Second Croatian Computer Vision Workshop September 19, 2013, Zagreb, Croatia P R OF CCVW E 2 0 1 3 E D I N G S University of Zagreb Center of Excellence for Computer Vision Proceedings of the Croatian Computer Vision Workshop CCVW 2013, Year 1 Publisher: University of Zagreb Faculty of Electrical Engineering and Computing, Unska 3, HR-10000 Zagreb, Croatia ISSN 1849-1227 http://www.fer.unizg.hr/crv/proceedings CCVW 2013 Proceedings of the Croatian Computer Vision Workshop Zagreb, Croatia, September 19, 2013 S. Lončarić, S Šegvić (Eds.) Organizing Institution Center of Excellence for Computer Vision, Faculty of Electrical Engineering and Computing, University of Zagreb, Croatia Technical Co-Sponsors IEEE Croatia Section IEEE Croatia Section Computer Society Chapter IEEE Croatia Section Signal Processing Society Chapter Donators Smartek d.o.o. AVL–AST d.o.o. Peek promet d.o.o. PhotoPay Ltd. Proceedings of the Croatian Computer Vision Workshop CCVW 2013, Year 1 Editor-in-chief Sven Lončarić ([email protected]) University of Zagreb Faculty of Electrical Engineering and Computing Unska 3, HR-10000, Croatia Editor Siniša Šegvić ([email protected]) University of Zagreb Faculty of Electrical Engineering and Computing Unska 3, HR-10000, Croatia Production, Publishing and Cover Design Tomislav Petković ([email protected]) University of Zagreb Faculty of Electrical Engineering and Computing Unska 3, HR-10000, Croatia Publisher University of Zagreb Faculty of Electrical Engineering and Computing Unska 3, HR-10000 Zagreb, OIB: 57029260362 c 2013 by the University of Zagreb. Copyright This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. http://creativecommons.org/licenses/by-nc-sa/3.0/ ISSN 1849-1227 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia ii Preface On behalf of the Organizing Committee it is my pleasure to invite you to Zagreb for the 2nd Croatian Computer Vision Workshop. The objective of the Workshop is to bring together professionals from academia and industry in the area of computer vision theory and applications in order to foster research and encourage academia-industry collaboration in this dynamic field. The Workshop program includes oral and poster presentations of original peer reviewed research from Croatia and elsewhere. Furthermore, the program includes invited lectures by distinguished international researchers presenting state-of-the-art in computer vision research. Workshop sponsors will provide perspective on needs and activities of the industry. Finally, one session shall be devoted to short presentations of activities at Croatian research laboratories. The Workshop is organized by the Center of Excellence for Computer Vision, which is located at the Faculty of Electrical Engineering and Computing (FER), University of Zagreb. The Center joins eight research laboratories at FER and research laboratories from six constituent units of the University of Zagreb: Faculty of Forestry, Faculty of Geodesy, Faculty of Graphic Arts, Faculty of Kinesiology, Faculty of Mechanical Engineering and Naval Architecture, and Faculty of Transport and Traffic Sciences. Zagreb is a beautiful European city with many cultural and historical attractions, which I am sure all participants will enjoy. I look forward to meet you all in Zagreb for the 2nd Croatian Computer Vision Workshop. September 2013 Sven Lončarić, General Chair Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia iii Acknowledgements The 2013 2nd Croatian Computer Vision Workshop (CCVW) is the result of the committed efforts of many volunteers. All included papers are results of dedicated research. Without such contribution and commitment this Workshop would not have been possible. Program Committee members and reviewers have spent many hours reviewing submitted papers and providing extensive reviews which will be an invaluable help in future work of collaborating authors. Managing the electronic submissions of the papers, the preparation of the abstract booklet and of the online proceedings also required substantial effort and dedication that must be acknowledged. The Local Organizing Committee members did an excellent job to guarantee a successful outcome of the Workshop. We are grateful to the Technical Co-Sponsors, who helped us in granting the high scientific quality of the presentations, and to the Donators that financially supported this Workshop. Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia iv Contents Organizing Committee ....................................................... 1 Reviewers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 CCVW 2013 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Oral Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 N. Banić, S. Lončarić Using the Random Sprays Retinex Algorithm for Global Illumination Estimation . . . . . . . . . . . 3 K. Brkić, S. Rašić, A. Pinz, S. Šegvić, Z. Kalafatić Combining Spatio-Temporal Appearance Descriptors and Optical Flow for Human Action Recognition in Video Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 M. Lopar, S. Ribarić An Overview and Evaluation of Various Face and Eyes Detection Algorithms for Driver Fatigue Monitoring Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 I. Sikirić, K. Brkić, S. Šegvić Classifying Traffic Scenes Using The GIST Image Descriptor . . . . . . . . . . . . . . . . . . . . . . 19 Poster Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 K. Kovačić, E. Ivanjko, H. Gold Computer Vision Systems in Road Vehicles: A Review . . . . . . . . . . . . . . . . . . . . . . . . . . 25 R. Cupec, E. K. Nyarko, D. Filko, A. Kitanov, I. Petrović Global Localization Based on 3D Planar Surface Segments . . . . . . . . . . . . . . . . . . . . . . . . 31 V. Zadrija, S. Šegvić Multiclass Road Sign Detection using Multiplicative Kernel . . . . . . . . . . . . . . . . . . . . . . . 37 I. Krešo, M. Ševrović, S. Šegvić A Novel Georeferenced Dataset for Stereo Visual Odometry . . . . . . . . . . . . . . . . . . . . . . . 43 V. Hrgetić, T. Pribanić Surface Registration Using Genetic Algorithm in Reduced Search Space . . . . . . . . . . . . . . . . 49 M. Muštra, M. Grgić Filtering for More Accurate Dense Tissue Segmentation in Digitized Mammograms . . . . . . . . . . 53 T. Petković, D. Jurić, S. Lončarić Flexible Visual Quality Inspection in Discrete Manufacturing . . . . . . . . . . . . . . . . . . . . . . 58 Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia 63 v Organizing Committee General Chair Sven Lončarić, University of Zagreb, Croatia Technical Program Committe Chair Siniša Šegvić, University of Zagreb, Croatia Local Arrangements Chair Zoran Kalafatić, University of Zagreb, Croatia Publications Chair Tomislav Petković, University of Zagreb, Croatia Technical Program Committee Bart Bijnens, Spain Hrvoje Bogunović, USA Mirjana Bonković, Croatia Karla Brkić, Croatia Andrew Comport, France Robert Cupec, Croatia Albert Diosi, Slovakia Hrvoje Dujmić, Croatia Ivan Fratrić, Switzerland Mislav Grgić, Croatia Sonja Grgić, Croatia Edouard Ivanjko, Croatia Zoran Kalafatić, Croatia Stanislav Kovačič, Slovenia Zoltan Kato, Hungary Josip Krapac, Croatia Matej Kristan, Slovenia Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia Sven Lončarić, Croatia Lidija Mandić, Croatia Thomas Mensink, Netherlands Vladan Papić, Croatia Tomislav Petković, Croatia Ivan Petrović, Croatia Thomas Pock, Austria Tomislav Pribanić, Croatia Arnau Ramisa, Spain Gianni Ramponi, Italy Anthony Remazeilles, Spain Slobodan Ribarić, Croatia Andres Santos, Spain Damir Seršić, Croatia Darko Stipaničev, Croatia Marko Subašić, Croatia Federico Sukno, Ireland 1 Reviewers Albert Diosi, Austraila Andres Santos, Spain Andrew Comport, France Anthony Remazeilles, France Arnau Ramisa, Spain Axel Pinz, Austria Bart Bijnens, Spain Darko Stipaničev, Croatia Edouard Ivanjko, Croatia Emmanuel Karlo Nyarko, Croatia Federico Sukno, Argentina Hrvoje Bogunović, USA Hrvoje Dujmić, Croatia Ivan Fratrić, Croatia Ivan Marković, Croatia Josip Krapac, Croatia Josip Ćesić, Croatia Karla Brkić, Croatia Lidija Mandić, Croatia Marko Subašić, Croatia Matej Kristan, Slovenia Maxime Meilland, France Mirjana Bonković, Croatia Mislav Grgić, Croatia Robert Cupec, Croatia Siniša Šegvić, Croatia Slobodan Ribarić, Croatia Sonja Grgić, Croatia Stanislav Kovačič, Slovenia Thomas Mensink, The Netherlands Thomas Pock, Austria Tomislav Petković, Croatia Tomislav Pribanić, Croatia Vladan Papić, Croatia Zoltan Kato, Hungary Zoran Kalafatić, Croatia Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia 2 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia Using the Random Sprays Retinex algorithm for global illumination estimation Nikola Banić and Sven Lončarić University of Zagreb, Faculty of Electrical Engineering and Computing Unska 3, 10000 Zagreb, Croatia E-mail: {nikola.banic, sven.loncaric}@fer.hr Abstract—In this paper the use of Random Sprays Retinex (RSR) algorithm for global illumination estimation is proposed and its feasibility tested. Like other algorithms based on the Retinex model, RSR also provides local illumination estimation and brightness adjustment for each pixel and it is faster than other path-wise Retinex algorithms. As the assumption of the uniform illumination holds in many cases, it should be possible to use the mean of local illumination estimations of RSR as a global illumination estimation for images with (assumed) uniform illumination allowing also the accuracy to be easily measured. Therefore we propose a method for estimating global illumination estimation based on local RSR results. To our best knowledge this is the first time that RSR algorithm is used to obtain global illumination estimation. For our tests we use a publicly available color constancy image database for testing. The results are presented and discussed and it turns out that the proposed method outperforms many existing unsupervised color constancy algorithms. The source code is available at http://www.fer.unizg.hr/ipg/resources/color constancy/. information (HLVI) [7], probabilistic algorithms [8] and combination of existing methods [9]. The result of all mentioned algorithms is a single vector representing the global illumination estimation which is than used in chromatic adaptation to create an appearance of another desired illumination. Algorithms like those based on gamut mapping have the advantage of greater accuracy, but they need to be trained first. Some simpler and unsupervised algorithms like Gray-world or Gray-edge are of lesser accuracy, but are easy to implement and have a low computation cost. Keywords—white balance, color constancy, Retinex, Random Sprays Retinex, sampling I. e= ! = Z I(λ)ρc (λ)dλ (1) ω and it represents the illumination estimation. Examples of color constancy algorithms include the grayworld algorithm [2], shades of gray [3], grey-edge [4], gamut mapping [5], using neural networks [6], using high-level visual CCVW 2013 Oral Session (b) Fig. 1: Same object under different illuminations. I NTRODUCTION A well known feature of human vision system (HVS) is its ability to recognize the color of objects under variable illumination when this color depends on the color of the light source [1]. An example of an object under different light sources can be seen on Fig. 1. This feature is called color constancy and achieving it computationally can significantly enhance the quality of digital images. Even though the HVS has generally no problem with it, computational color constancy is an illposed problem and assumptions have to be made for color constancy algorithms. Some of these assumptions include color distribution, uniform illumination, presence of white patches etc. After taking the Lambertian and one single light source assumption, the dependence of observed color of the light source e on the light source I(λ) and camera sensitivity function ρ(λ), which are both unknown, can be written as eR eG eB (a) A number of algorithms like [10] estimate the illumination locally and than combine multiple local results into one global thus also producing the global illumination estimation. In this paper we propose a similar method based on the Random Sprays Retinex (RSR) [11], an algorithm of Retinex model, which deals with locality of color perception [12], a phenomenon by which the HVS’s perception of colors is influenced by the colors of adjacent areas in the scene. The algorithms of the Retinex model provide local white balancing and brightness adjustment producing so an enhanced image and not a single vector. If the assumption of uniform illumination is taken, then a single illumination estimation vector can be created from combined local estimations. RSR was chosen because it has the advantage of being faster than other path-wise Retinex algorithms [11]. This is the structure of the paper: in Section II a simple explanation of the Random Sprays Retinex algorithms is given, in Section III our proposed method for global illumination estimation is explained and in Section IV the evaluation results are presented and discussed. II. R ANDOM S PRAYS R ETINEX ALGORITHM The Random Sprays Retinex algorithm was developed by taking into consideration the mathematical description of 3 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia (a) (b) (c) (d) Fig. 2: Examples of images for various RSR parameters: (a) original image from the ColorChecker database, (b) N = 1, n = 16, (c) N = 5, n = 20, (d) N = 20, n = 400. Retinex provided in [13]. After simplifying the initial model, it can be proved that the lightness value of pixel i for a given channel can be calculated by using this formula N 1 X I(i) L(i) = N I(xHk ) III. (2) k=1 where I(i) is the initial intensity of pixel i, N is the number of paths and xHk is the index of the pixel with the highest intensity along the kth path. The next step towards RSR in [11] is to notice three reasons for which paths should be replaced with something else: they are redundant, their ordering is completely unimportant and they have inadequate topological dimension. This leads to use of 2-D objects as representations of pixel neighbourhood, which is taken into consideration when calculating the new pixel intensity. Random sprays are finally chosen as these 2-D objects leading to several parameters that need to be tuned. The value of the spray radius is set to be equal to the diagonal of the image. The identity function is taken as the radial density function. The minimal number of sprays (N ) and the minimal number of pixels per spray (n) representing a trade-off between results quality and computation cost were determined to be 20 and 400 respectively [11]. Fig. 2(a) shows a test image from the ColorChecker image database CCVW 2013 Oral Session [8]. RSR processed images for various parameters are shown on Fig. 2(b), Fig. 2(c) and Fig. 2(d). P ROPOSED METHOD A. Basic idea From the original image and the RSR resulting image for a pixel i it is possible to calculate its relative intensity change for a given channel c using the equation Cc (i) = Ic (i) Rc (i) (3) where Cc (i) is the intensity change of pixel i for channel c, Ic (i) the original intensity and Rc (i) the intensity obtained by RSR. Considering the way it is calculated, the vector > p(i) = [Cr (i), Cg (i), Cb (i)] composed of one pixel intensity change element for each channel can be interpreted as RSR local illumination estimation and since it is not necessarily normalized, it also represents the local brightness adjustment. That means that p(i) can also be written as p(i) = w(i)p̂(i) where w(i) = kp(i)k is the norm of p(i) and p̂(i) is the unit vector with the same direction as p(i). The merged result of Eq. 3 for all channels calculated by using the RSR result from Fig. 2(b) is shown on Fig. 3(a). 4 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia (a) (b) Fig. 3: (a) Local intensity changes for image shown on Fig. 2(a). (b) Local intensity changes after blurring the original and the RSR image. As it is obvious that for some cases like the one shown on Fig. 3(a) there is a higher level of visible noise when visualising the intensity changes, it might be a good thing to try to lessen the noise in some way before further using the calculated changes. A good way of doing so is to apply a modification to Eq. 3: Cc,k (i) = (Ic ∗ k)(i) (Rc ∗ k)(i) (4) where k is a chosen kernel and ∗ is the convolution operator. By applying Eq. 4 with an averaging kernel of size 25 × 25 instead of Eq. 3, the result shown on Fig. 3(a) turns into the one shown on Fig. 3(b). A simple way of obtaining the global illumination estimation now is to calculate the vector e= M X p(i) = M X w(i)p̂(i) (5) i=1 i=1 where e is the final global illumination estimation vector and M is the number of pixels in the image. The division by M is omitted since we are only interested in the direction of e and not its norm. It is obvious that vectors p(i) with greater corresponding value of w(i) will have a greater impact on the final direction of vector e so w(i) can also be interpreted as the weight. A simple alternative to Eq. 5 is to omit the weight w(i): M X e= p̂(i) (6) i=1 B. Pixel sampling By looking at Fig. 3(b), it is clear that the difference between intensity changes of spatially close pixels is not large. This can be taken advantage of by not calculating p(i) for all pixels, but only for some of them. For that purpose the row step r and the column step c are introduced meaning that only every cth pixel only in every rth row is processed. By doing so, the computation cost is reduced, which can have an important impact on speed when greater values for r and c are used. CCVW 2013 Oral Session C. Parameters The proposed method inherits all parameters from RSR, the two most important being the number of sprays N and the size of individual sprays n. Additional parameters include filter kernel type and size and r and c, the row step and column step respectively for choosing the pixels for which to estimate the illumination. As all of these parameters have a potential influence on the final direction, a tuning is necessary. Due to many possible combination of parameters, the parameters inherited from RSR and not mentioned in this subsection are set to values they were tuned to in [11]. Also in order to simplify further testing, only the averaging kernel was used due to its simplicity, low computation cost and no apparent disadvantage over other kernels after several simple tests (it even provided for a greater accuracy over the case when Gaussian kernels were used). D. Method’s name As described in a previous subsection, the proposed method is designed to have an adjustable pixel sampling rate. As this allows the proposed method to ”fly” over some pixels without visiting them, we decided to name the proposed method Color Sparrow (CS). IV. E VALUATION AND RESULTS A. Image database and testing method For testing the proposed method and tuning the parameters the publicly available re-processed version [14] of the ColorChecker database described in [8] was used. It contains 568 different linear RGB images taken both outside and in closed space each of them containing a GretagMacbeth color checker used to calculate the groundtruth illumination data which is provided together with the images. The positions of the whole color checker and individual color patches of the color checker in each image are also provided, which allowes for the color checker to be simply masked out during the evaluation. As the error measure for white balancing the angle between RGB value of groundtruth illumination and the estimated illumination of an image was chosen. It should be 5 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia (a) Mean angle error (b) Median angle error Fig. 4: Performance of several parameter settings with respect to different kernel sizes noted that because the images in the Color Checker database are of medium variety [15], it might be necessary to retune the parameters by using some other, larger databases in order to obtain parameter values that would be a good choice for images outside the Color Checker database. Earlier in this paper two ways were proposed to estimate the direction of the global illumination and these are represented by Eq. 5 and Eq. 6. As in the performed experiments the former equation slightly outperformed the latter, all results mentioned in this paper were obtained by using Eq. 5, which means that there was no normalization of local intensity change vectors. B. Tuning the kernel size As explained before, the kernel type was fixed to averaging kernel. Fig. 4 shows two aspects of kernel size influence on method performance. All parameter settings have their r and c parameters set to 10 and N was set to 1, while the value of n varies. The reason to use only one value for N is that raising it has insignificant impact on the performance. The graphs on Fig. 4 show a clear performance difference between application of Eq. 3, which uses no kernel, and Eq. 4. It is interesting to note that different kernel sizes have only slight impact on the mean angular error but this impact is greater on the median error. While the mean raises from 3.7◦ to 3.8◦ only when using the 205 × 205 kernel, the median raises from 2.8◦ to 2.9◦ when using kernels of size greater than 25 × 25. Therefore the kernel size should be in the range from 5 × 5 to 25 × 25 inclusive and in further tests the kernel size 5 × 5 is used. CCVW 2013 Oral Session C. Tuning N and n parameters Both Fig. 5 and Fig. 4 show mean and median angle between groundtruth global illumination and global illumination estimation of CS calculated on Shi’s images with various values of parameter n for several fixed combinations of other parameters. It can be seen the lowest mean angular error of 3.7◦ and lowest median error of 2.8◦ are achieved for the value of n between 200 and 250 and that this is almost invariant to values of other parameters. For that reason the result of parameter n tuning is the value 225. As in previous tuning, in this case using greater values for parameter N had also only slight impact and was therefore omitted. Performance for distinct image sizes was also tested by simply shrinking the original Shi’s images and there was no significant difference in results. D. Tuning r and c parameters In order to retain the lowest achieved median and mean errors, the r and c parameter values can be raised up to 50 setting at the same time n to 225 a shown on Fig. 5. It is interesting to mention that even setting r and c to 200 raises the mean angular error only to 3.8◦ . However, as the median is less stable, the values of r and c should not exceed 50 in order to avoid a significant loss of accuracy as after this point the median angle error raises to 2.9◦ . E. Comparison to other algorithms Table I shows performance of various methods. For the Standard Deviation Weighted Gray-world (SDWGW) the images were subdivided into 100 blocks. Color Sparrow was 6 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia (a) Mean angle error (b) Median angle error Fig. 5: Performance of several parameter settings with respect to different sampling rates TABLE I: Performance of different color constancy methods method mean (◦ ) median (◦ ) trimean (◦ ) max (◦ ) do nothing Gray-world SDWGW Shades of gray Gray-edge Intersection-based Gamut Pixel-based Gamut HLVI proposed method 13.7 6.4 5.4 4.9 5.1 4.2 4.2 3.5 3.7 13.6 6.3 4.9 4.0 4.4 2.3 2.3 2.5 2.8 13.5 6.3 4.9 4.2 4.6 2.9 2.9 2.6 3.1 27.4 24.8 22.9 22.4 23.9 24.2 23.2 25.2 23.4 computer with i5-2500K CPU and only one core was used. Color Sparrow was again used with parameters N = 1, n = 225, r = 50, c = 50 and with an averaging kernel of size 5×5. The algorithms were run several times on 100 images from the Shi’s version of the Color Checker database. The average time for several runs of Gray-world algorithm was 2.9s for 100 images and the average time for CS was 3.03s. These results show that even though CS is slower, it still performs almost as fast as Gray-world, but with more accuracy. V. used with parameters N = 1, n = 225, r = 50, c = 50 and with an averaging kernel of size 5 × 5. The performance of other mentioned methods was taken from [16] where parameter values are also provided. The method described in [17] is not mentioned because the comparison would not be fair since this was not designed for images recorded under assumed single light source. It is possible to see that CS outperforms many methods and is therefore a suitable choice for white balancing. F. Speed comparison As speed is an important factor property of white balancing algorithms, especially in digital cameras that have lower computation power, a speed test was also performed for CS. In order to compare the result of the speed test to something, a speed test was also performed for the Gray-world algorithm because it is one of the simplest and fastest white balancing algorithms. For both speed tests a C++ implementation of both algorithms was used on Windows 7 operating system on a CCVW 2013 Oral Session C ONCLUSION AND FUTURE RESEARCH Color Sparrow is a relatively fast and accurate method for white balancing. Even though in its core it is a modification of RSR, it calculates only the global illumination estimation and it also outmatches several other white balancing methods. It has the advantage of being unsupervised and performing well under lower sampling rates allowing a lower computation cost. This leads to conclusion that using RSR for global illumination estimation has a good potential of being a fast and accurate unsupervised color constancy method. However, as tests used in this paper were relatively simple, more exhaustive tests need to be performed in order to see if the accuracy can be improved even further. In future it would be good to test the proposed method on other color constancy databases and to perform experiments with other types of areas around particular pixels. ACKNOWLEDGMENTS The authors acknowledge the advice of Arjan Gijsenij on proper testing and comparison of color constancy algorithms. 7 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia This work has been supported by the IPA2007/HR/16IPO/001040514 project ”VISTA - Computer Vision Innovations for Safe Traffic.” R EFERENCES [1] M. Ebner, Color constancy. Wiley, 2007, vol. 6. [2] G. Buchsbaum, “A spatial processor model for object colour perception,” Journal of the Franklin institute, vol. 310, no. 1, pp. 1–26, 1980. [3] G. D. Finlayson and E. Trezzi, “Shades of gray and colour constancy,” 2004. [4] J. Van De Weijer, T. Gevers, and A. Gijsenij, “Edge-based color constancy,” Image Processing, IEEE Transactions on, vol. 16, no. 9, pp. 2207–2214, 2007. [5] G. Finlayson, S. Hordley, and I. Tastl, “Gamut constrained illuminant estimation,” International Journal of Computer Vision, vol. 67, no. 1, pp. 93–109, 2006. [6] V. C. Cardei, B. Funt, and K. Barnard, “Estimating the scene illumination chromaticity by using a neural network,” JOSA A, vol. 19, no. 12, pp. 2374–2386, 2002. [7] J. Van De Weijer, C. Schmid, and J. Verbeek, “Using high-level visual information for color constancy,” in Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on. IEEE, 2007, pp. 1–8. [8] P. V. Gehler, C. Rother, A. Blake, T. Minka, and T. Sharp, “Bayesian color constancy revisited,” in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008, pp. 1–8. [9] M. Šavc, D. Zazula, and B. Potočnik, “A Novel Colour-Constancy Algorithm: A Mixture of Existing Algorithms,” Journal of the Laser and Health Academy, vol. 2012, no. 1, 2012. [10] H.-K. Lam, O. Au, and C.-W. Wong, “Automatic white balancing using standard deviation of RGB components,” in Circuits and Systems, 2004. ISCAS’04. Proceedings of the 2004 International Symposium on, vol. 3. IEEE, 2004, pp. III–921. [11] E. Provenzi, M. Fierro, A. Rizzi, L. De Carli, D. Gadia, and D. Marini, “Random spray retinex: a new retinex implementation to investigate the local properties of the model,” Image Processing, IEEE Transactions on, vol. 16, no. 1, pp. 162–171, 2007. [12] E. H. Land, J. J. McCann et al., “Lightness and retinex theory,” Journal of the Optical society of America, vol. 61, no. 1, pp. 1–11, 1971. [13] E. Provenzi, L. De Carli, A. Rizzi, and D. Marini, “Mathematical definition and analysis of the Retinex algorithm,” JOSA A, vol. 22, no. 12, pp. 2613–2621, 2005. [14] B. F. L. Shi. (2013, May) Re-processed Version of the Gehler Color Constancy Dataset of 568 Images. [Online]. Available: http://www.cs.sfu.ca/colour/data/ [15] A. Gijsenij, T. Gevers, and J. Van De Weijer, “Computational color constancy: Survey and experiments,” Image Processing, IEEE Transactions on, vol. 20, no. 9, pp. 2475–2489, 2011. [16] T. G. A. Gijsenij and J. van de Weijer. (2013, May) Color Constancy — Research Website on Illuminant Estimation. [Online]. Available: http://colorconstancy.com/ [17] A. Gijsenij, R. Lu, and T. Gevers, “Color constancy for multiple light sources,” Image Processing, IEEE Transactions on, vol. 21, no. 2, pp. 697–707, 2012. CCVW 2013 Oral Session 8 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia Combining spatio-temporal appearance descriptors and optical flow for human action recognition in video data Karla Brkić∗ , Srd̄an Rašić∗ , Axel Pinz‡ , Siniša Šegvić∗ and Zoran Kalafatić∗ ∗ University of Zagreb, Faculty of Electrical Engineering and Computing, Unska 3, 10000 Zagreb, Croatia Email: [email protected] ‡ Graz University of Technology, Inffeldgasse 23/II, A-8010 Graz, Austria Email: [email protected] Abstract—This paper proposes combining spatio-temporal appearance (STA) descriptors with optical flow for human action recognition. The STA descriptors are local histogram-based descriptors of space-time, suitable for building a partial representation of arbitrary spatio-temporal phenomena. Because of the possibility of iterative refinement, they are interesting in the context of online human action recognition. We investigate the use of dense optical flow as the image function of the STA descriptor for human action recognition, using two different algorithms for computing the flow: the Farnebäck algorithm and the TVL1 algorithm. We provide a detailed analysis of the influencing optical flow algorithm parameters on the produced optical flow fields. An extensive experimental validation of optical flow-based STA descriptors in human action recognition is performed on the KTH human action dataset. The encouraging experimental results suggest the potential of our approach in online human action recognition. I. I NTRODUCTION Human action recognition is one of the central topics of interest in computer vision, given its wide applicability in human-computer interfaces (e.g. Kinect [1]), assistive technologies [2] and video surveillance [3]. Although action recognition can be done from static images, the focus of current research is on using video data. Videos offer insight into the dynamics of the observed behavior, but with the price of increased storage and processing requirements. Efficient descriptors that compactly represent the video data of interest are therefore a necessity. This paper proposes combining the spatio-temporal appearance (STA) descriptors [4] with two variants of dense optical flow for human action recognition. STA descriptors are a family of histogram-based descriptors that efficiently integrate temporal information from frame to frame, producing a fixedlength representation that is refined as new information arrives. As such, they are suitable for building a model of an action online, while the action is still happening, i.e. not all frames of the action are available. The topic of building a refinable action model is under-researched and of great practical importance, especially in video surveillance and assistive technologies, where it is important to raise an alarm as soon as possible if an action is classified as dangerous. Originally, STA descriptors were built on a per-frame basis, by calculating and histogramming the values of an arbitrary CCVW 2013 Oral Session image function (e.g. hue, gradient) over a regular grid within a region of interest in the frame. We propose extending this concept to pairs of frames, so that optical flow computed between pairs of frames is used as an image function whose values are histogrammed within the STA framework. We consider two variants of dense optical flow, the Farnebäck optical flow [5], and the TV-L1 optical flow [6]. II. R ELATED WORK There is a large body of research concerning human action recognition, covered in depth by a considerable number of survey papers (e.g. [7], [8], [9], [10]). Poppe [8] divides methods for representing human actions into two categories: local and global. In local representation methods, the video containing the action is represented as a collection of mutually independent patches, usually calculated around space-time interest points. In global representation methods, the whole video is used in building the representation. An especially popular category of local representation methods are based on a generalization of the 2D bag-of-visualwords framework [11] to spatio-temporal data. They model local appearance, often using optical flow, and commonly use histogram-based features similar to HOG [12] or SIFT [13]. In this overview we focus on these methods, as they bear the most resemblance to our approach. The standard processing chain the bag-of-visual-words methods use is summarized in [14]. It includes finding spatiotemporal interest points, extracting local spatio-temporal volumes around these points, representing them as features and using these features in a bag-of-words framework for classification. Interest point detectors and features used are most commonly generalizations of well-known 2D detectors and features. For instance, one of the earliest proposed spatiotemporal interest point detectors, proposed by Laptev and Lindeberg [15], is a spatio-temporal extension of the Harris corner detector [16]. Another example is the Kadir-Brady saliency [17], extended to temporal domain by Oikonomopoulos et al. [18]. Willems et al. [19] introduce a dense scale-invariant spatio-temporal interest point detector, a spatio-temporal counterpart of the Hessian saliency measure. However, there are 9 Proceedings of the Croatian Computer Vision Workshop, Year 1 space-time-specific interest point detectors as well, such as the one proposed by Dollár et al. [20]. Representations of spatio-temporal volumes extracted around interest points are typically histogram-based. For example, Dollár et al. [20] propose three methods of representing volumes as features: (i) by simply flattening the grayscale values in the volume into a vector, (ii) by calculating the histogram of grayscale values in the volume or (iii) by dividing the volume into a number of regions, constructing a local histogram for each region and then concatenating all the histograms. Laptev et al. [21] propose dividing each volume into a regular grid of cuboids using 24 different grid configurations, and representing each cuboid by normalized histograms of oriented gradient and optical flow. Willems et al. [19] build on the work of Bay et al. [22] and represent spatio-temporal volumes using a spatio-temporal generalization of the SURF descriptor. In the generalization, they approximate Gaussian derivatives of the second order with their box filter equivalents in space-time, and use integral video for efficient computation. Wang et al. [14] present a detailed performance comparison of several human action recognition methods outlined here. The actions are represented using combinations of six different local feature descriptors, and either three different spatiotemporal interest point detectors or dense sampling instead of using interest points. Three datasets are used for the experiments: the KTH action dataset [23], the UCF sports action dataset [24] and the highly complex Hollywood2 action dataset [25]. The best results are 92.1% for the KTH dataset (Harris3D + HOF), 85.6% for the UCF sports dataset (dense sampling + HOG3D), and 47.4% for the Hollywood2 dataset (dense sampling + HOG/HOF). Experiments indicate that regular sampling of space-time outperforms interest point detectors. Also note that histograms of optical flow (HOF) are the bestperforming features in 2 of 3 datasets. The use of STA descriptors [4] can offer a somewhat different perspective on the problem of human action recognition. To build an STA descriptor, one needs a person detector and a tracker, as the STA algorithm assumes that bounding boxes around the object of interest are known. Although this is a shortcoming when compared with the outlined bag-of-visualwords methods that do not need any kind of information about the position of the human, the STA descriptors come with an out-of-the box capability of building descriptions of partial actions, and are therefore worth considering. The bagof-visual-words methods could also be generalized to support partial actions, but the generalization is not straightforward. III. B UILDING STA S WITH OPTICAL FLOW A. Spatio-temporal appearance descriptors Spatio-temporal appearance (STA) descriptors [4] are fixedlength descriptors that represent a series of temporally related regions of interest at a given point in time. Two variants exist: STA descriptors of the first order (STA1) and STA descriptors of the second order (STA2). Let us assume that we have a series of regions of interest defined as bounding boxes of a human performing an action. The action need not be complete: CCVW 2013 Oral Session September 19, 2013, Zagreb, Croatia assume that the total duration of the action is T , and we have seen t < T frames. To build an STA descriptor, one first divides each available region of interest into a regular grid of rectangular patches. The size of the grid, m × n (m is the number of grid columns and n is the number of grid rows), is a parameter of the algorithm. One then calculates an arbitrary image function (e.g. hue, gradient) for all the pixels of each patch, and represents the distribution of the values of this function over the patch by a k1 -bin histogram. Therefore, for each available region of interest, one obtains an m × n grid of k1 -binned histograms, called grid histograms. Let g(θ) denote a m × n × k1 vector that contains concatenated histogram frequencies of all the m × n histograms of the grid in time θ, called the grid vector. The STA1 descriptor at time t is obtained by weighted averaging of grid vectors, STA1 (t) = t X αθ g(θ) . (1) θ=1 As a weighted average, the STA1 descriptor is an adequate representation of simpler spatio-temporal phenomena, but fails to capture the dynamics of complex behaviors such as human actions. When averaging, the information on the distribution of relative bin frequencies is lost. The STA2 descriptor solves this problem, by explicitly modeling the distribution of each grid (t) histogram bin value over time. Let the vector ci , called the component vector, be a vector of values of the i-th component g(θ) (i) of the grid vector g(θ) up to and including time t, 1 ≤ θ ≤ t: h iT (t) (2) ci = g(1) (i), g(2) (i), g(3) (i), . . . , g(t) (i) . The STA2 descriptor in time t is obtained by histogramming the m × n × k1 component vectors, so that each component vector is represented by a k2 -bin histogram, called the STA2 histogram. To obtain the final descriptor, one concatenates the bin frequencies of m × n × k1 STA2 histograms into a feature vector, h iT (t) (t) (t) STA2 (t) = Hk2 (c1 ), Hk2 (c2 ), . . . , Hk2 (cmnk1 ) . (3) Here the notation Hk2 (c) indicates a function that builds a k2 bin histogram of values contained in the vector c and returns a vector of histogram bin frequencies. Note that the STA1 descriptor has a length of m × n × k1 components, while the STA2 descriptor has a length of m × n × k1 × k2 components. B. Farnebäck and TV-L1 optical flow We consider using the following two algorithms for estimating dense optical flow: the algorithm of Farnebäck [5] and the TV-L1 algorithm [6]. Farnebäck [5] proposes an algorithm for estimating dense optical flow based on modeling the neighborhoods of each pixel by quadratic polynomials. The idea is to represent the image signal in the neighborhood of each pixel by a 3D surface, and determine optical flow by finding where the surface has moved in the next frame. The optimization is not 10 precision and running time. A small value will yield more accurate solutions at the expense of a slower convergence. Proceedings of the Croatian Computer Vision Workshop, Year 1 • η is the downsampling factor. It is used to downscale the original images in order to create the pyramidal structure. Its value must be in the interval (0, 1).19, 2013, Zagreb, Croatia September • Nscales is used to create the pyramid of images. If the flow field is very small (about one pixel), it can be set to 1. Otherwise, it should be set so that (1/η)N − 1 is larger than the expected size of the largest displacement. See our article on Horn–Schunck optical flow [3] for further details on this and the η parameters. • Nwarps represents the number of times that I1 (x + u0 ) and ∇I1 (x + u0 ) are computed per scale. done on a pixel-level, but rather on a neighborhood-level, soThis is a parameter that assures the stability of the method. It also affects the running time, that the optimum displacement is found both for the pixel andso it is a compromise between speed and accuracy. its neighbors. The TV-L1 optical flow of Zach et al [6] is based on a Table 1: Parameters of the method. Parameter Description Default value robust formulation of the classical Horn and Schunck approach τ time step 0.25 (a) (b) (c) [26]. It allows for discontinuities in the optical flow field and λ data attachment weight 0.15 robustness to image noise. The algorithm efficiently minimizes θ tightness 0.3 ε stopping threshold 0.01 a functional containing a data term using the L1 norm and a η zoom factor 0.5 regularization term using the total variation of the flow [27]. Nscales number of scales 5 C. Combining STAs and optical flow Nwarps number of warps 5 The STA descriptors are built using grids of histograms (d) (e) (f) of values of an arbitrary image function. Optical flow 5is an Examples image function that considers pairs of images, and assigns a Fig. 1: A few example frames from the KTH dataset: (a) Figures 2 and 3 show the optical flows for the Ettlinger–Tor and the Rheinhafen sequences, respecwalking, (c)http://www.ira.uka.de/image_sequences/. running, (d) boxing, (e) waving, (f)The results vector to each pixel of the first image. The main issue tively. to beThese sequences(b) canjogging, be found at obtained are similar to the results in Zach et al. [6]. In these examples we have used the following clapping addressed in building spatio-temporal appearance descriptors parameters: τ =0.25, θ=0.5, η = 0.5, ε=0.01, 5 scales and 5 warpings. The algorithm is executed for that use optical flow is how to build histograms of vectors. three values of λ. The issue of building histograms of vectors is well-known, and addressed in e.g. HOG [12] or SIFT [13] descriptors. The idea is to divide the 360◦ interval of possible vector orientations into the desired number of bins, and then count the number of vectors falling into each bin. In HOG and SIFT descriptors, the vote of each vector is additionally weighted by its magnitude, so that vectors with greater magnitudes bear more weight. When building optical flow, it is interesting to consider both Figure 1: Color scheme used to represent the orientation and magnitude of optical flows. the optical flow orientation histograms weighted by magnitude, Fig. 2: An illustration of the optical flow coloring scheme that we use. Figure reproduced from [27].in the Ettlinger–Tor traffic scene. It also The method detects the displacement of the objects and the “raw” optical flow orientation histograms, where there detects is no weighting by magnitude, i.e. all orientation votes are a regular background motion maybe due to a small displacement of the camera and the effect of noise. For small values of λ (λ=0.03), the solution is smoother and the optical flow is considered equal, as depending on the data the magnitude (160 × 120 pixels). There is some variation in the viewpoint. information can often be noisy. 142 The performance of the actions varies among actors, as does Note that when combining STA descriptors with optical the duration. The background is static and homogeneous. A flow, there is an inevitable lack of flow information for the few example frames from the KTH dataset are shown in Fig. 1. last frame of the sequence, as there are no subsequent frames. The sequences from the KTH action dataset do not come If one wishes to represent the entire sequence, the problem of the lack of flow can be easily solved by building the STA with a per-frame bounding box annotations. Therefore, in this paper we use the publicly available annotations of Lin descriptors in the frame before last. et al. [28]. These annotations were obtained automatically, IV. E XPERIMENTAL EVALUATION using a HOG detector, and are quite noisy, often only partially In our experiments, we consider the standard benchmark enclosing the human. KTH action dataset and first investigate the effect of optical flow parameters on the produced optical flow and the descrip- B. Optimizing the parameters of optical flow In order to compute the optical flow fields necessary for tivity of the derived STA descriptors. Using the findings from these experiments, we then use the best parameters of the building the STA descriptors, we used the implementations of flow and optimize the parameters of STA descriptors to arrive Farnebäck and TV-L1 optical flow from OpenCV 2.4.5 [29]. at the final performance estimate. Our experiments include Both implementations have a number of parameters that can only the STA descriptors of the second order (STA2), as influence the output optical flow and, in the end, classification STA1 descriptors are of insufficient complexity to capture the performance. To evaluate the influence of individual optical flow parameters on overall descriptivity of the representation, dynamics of human actions. we set up a simple test environment based on a support vector A. The KTH action dataset machine (SVM) classifier. We fixed the parameters of the STA The KTH human action dataset [23] consists of videos of descriptor, using a grid of 8 × 6 patches, the number of grid 25 actors performing six actions in four different scenarios. histogram bins k1 = 8 and the number of STA2 histogram bins The actions are walking, jogging, running, boxing, hand k2 = 5. We then performed a series of experiments where we waving and hand clapping, and the scenarios are outdoors, built a number of STA2 descriptors of the data, varying the outdoors with scale variation, outdoors with different clothes values of a single optical flow parameter for each descriptor and indoors. The videos are grayscale and of a low resolution built. The parameter evaluation procedure can be summarized CCVW 2013 Oral Session 11 Proceedings of the Croatian Computer Vision Workshop, Year 1 as follows: (i) For the given parameter set of the optical flow algorithm, calculate the STA2 descriptors over all sequences of the KTH action dataset. (ii) Train the SVM classifier using 25-fold cross-validation, so in each iteration the sequences of one person are used for testing, and the sequences of the remaining 24 persons for training. (iii) Record cross-validation performance. We used an OpenCV implementation of a linear SVM classifier, with termination criteria of either 105 iterations or an error tolerance of 10−12 . Other classifier parameters were set to OpenCV defaults, as the goal was not to optimize performance, but to use it as a comparison measure for different optical flow parameters. The optical flow histograms were built both with and without magnitude-based weighting of orientation votes. The obtained results were slightly better when using weighting, so the use of weighting is assumed in all experiments presented here. 1) Farnebäck optical flow: For the Farnebäck algorithm, we evaluated the influence of three parameters: the averaging window size w, the size of the pixel neighborhood considered when finding polynomial expansion in each pixel s, and the standard deviation σ of the Gaussian used to smooth derivatives in the polynomial expansion. The remaining parameters were set to their default OpenCV values. Visualizations of the computed optical flow when the parameters are varied, along with the corresponding obtained SVM recognition rates, are shown in Fig. 3, on an example frame of a running action. In this figure, we adhere to the color scheme for visualization of optical flow proposed by Sanchéz et al. [27], where each amplitude and direction of optical flow is assigned a different color (see Fig. 2). Subfigures 3 (a)-(c) show the influence of the change of parameter w, with fixed values of s and σ, subfigures 3 (d)-(e) show the influence of the change of parameter s, with fixed values of w and σ, and subfigures 3 (g)-(i) show the influence of the change of parameter σ, with fixed values of w and s. It can be seen that the changes along any of the three parameter axes significantly impact the derived optical flow field. Still, the recognition rate in all cases is close to 80%, except when the averaging window size is set to 1, resulting in a recognition rate of 69%. As illustrated in Fig. 3 (a), the optical flow in that case is present only in pixels very close to the human contour, which seems to be insufficient information to build an adequately descriptive representation. Use of a too big averaging window (Fig. 3 (c)) results in noise, and seems to cause a drop in performance. The size of the considered pixel neighborhood does not seem to have a considerable impact on performance, indicating the robustness of the Farnebäck method. Increasing σ smooths the derived optical flow, but this blurring lowers performance, so the values of σ should be kept low. 2) TV-L1 optical flow: For the TV-L1 optical flow, we consider the following parameters: the weight parameter for the data term λ, the tightness parameter θ, and the time step of the numerical scheme τ . CCVW 2013 Oral Session September 19, 2013, Zagreb, Croatia (a) w = 1 (69.4%) (b) w = 2 (81.1%) (c) w = 5 (79.7%) (d) s = 3 (80.9%) (e) s = 5 (81.1%) (f) s = 7 (81.0%) (g) σ = 1.1 (81.1%) (h) σ = 1.3 (79.4%) (i) σ = 1.7 (78.0%) Fig. 3: Obtained Farnebäck optical flow fields and corresponding obtained recognition rates (in parentheses) for varying values of: (a)-(c) parameter w (s = 5, σ = 1.1); (d)(f) parameter s (w = 2, σ = 1.1); (g)-(i) parameter σ (w = 2, s = 5). Fig. 4 shows visualizations of the computed optical flow when the parameters are varied, along with the corresponding obtained SVM recognition rates. Subfigures 4 (a)-(c) show the influence of the change of parameter λ, with fixed values of θ and τ , subfigures 4 (d)-(e) show the influence of the change of parameter θ, with fixed values of λ and τ , and subfigures 4 (g)-(i) show the influence of the change of parameter τ , with fixed values of λ and θ. The computed optical flow is noticeably crisper than when using he Farnebäck method, and there is significantly more noise in the background. Setting a mid-range value of the parameter λ, that influences the smoothness of the derived optical flow, seems to provide the optimum balance between reducing background noise and retaining optical flow information. Parameter θ should be kept low, as increasing it results in blurring of the derived flow and loss of information on the contour of the human. Parameter τ does not seem to significantly impact performance, but a midrange value performs best. Overall, Fig. 4 suggests that STA descriptors benefit from a crisp optical flow around the contour of the human and as low background noise as possible. C. Optimizing the parameters of STA descriptors Based on our analysis of the properties of individual optical flow parameters for the Farnebäck and the TV-L1 algorithm, we were able to select a good parameter combination that should result in solid classification performance. Still, one should also consider optimizing the parameters of the STA descriptors, that were fixed in the previous experiment (grid size of 8 × 6, k1 = 8, k2 = 5). Additionally, one should 12 optimize the parameters of the used classifier in order to obtain the optimum performance estimate. We considered simultaneously optimizing STA and classifier parameters with the goal of finding the best cross-validation performance on the KTH action dataset, using STA2 descriptors built using either Farnebäck or TV-L1 optical flow. We fixed the parameters of the Farnebäck and the TV-L1 algorithms to the best-performing ones, as found in the previous section (Farnebäck: w = 2, s = 5, σ = 1.1, TV-L1: λ = .05, θ = .1, τ = .15). A Cartesian product of the following STA parameter values was considered: m = {3, 6}, n = {6, 8}, k1 = {4, 5, 8}, k2 = {5, 8}. For each parameter combination, we performed 25-fold cross-validation multiple times, doing an exhaustive search over classifier parameter space to obtain optimum classifier parameters. Due to heavy computational load involved, we switched from using SVM to using a random forest classifier, because it is faster to train and offers performance that in our experiments turned out to be only slightly reduced when compared to SVM. We optimized the number of trees, the number of features and depth of the random forest classifier. A custom implementation based on the Weka library was used [30]. We repeated the experiments for both Farnebäck and TV-L1-based STA descriptors. The best recognition rate obtained when using Farnebäckbased descriptors was 82.4%, obtained for an 8 × 6 grid, grid histograms of 8 bins and STA2 histograms of 5 bins. The same recognition rate is obtained when using 8 STA2 histogram bins, but we favor shorter representations. The confusion table for the best-performing Farnebäck-based descriptor is shown in Table I. Notice how the confusion centers around two CCVW 2013 Oral Session Clapping Waving Jogging Running Walking 7 1 1 44 304 17 0 0 0 43 10 360 TABLE I: The confusion table for the best-performing classifier that uses STA2 feature vectors based on the Farnebäck optical flow to represent KTH action videos. Vertical axis: the correct class label, horizontal axis: distribution over predicted labels. Boxing Clapping Waving Jogging Running Walking Walking Fig. 4: Obtained TV-L1 optical flow fields and corresponding obtained recognition rates (in parentheses) for varying values of: (a)-(c) λ (τ = 0.25, θ = 0.3); (d)-(f) θ (τ = 0.25, λ = 0.15); (g)-(i) τ (λ = 0.15, θ = 0.3). 0 0 0 313 84 22 Running (i) τ = .35 (79.4%) 29 36 335 0 0 0 Jogging (h) τ = .15 (80.1%) (f) θ = .5 (79.5%) 27 324 34 0 0 0 Waving (g) τ = .05 (79.6%) (e) θ = .3 (80.1%) (c) λ = .30 (79.4%) 333 34 28 0 2 1 Clapping (d) θ = .1 (81.9%) (b) λ = .05 (82.2%) Boxing Clapping Waving Jogging Running Walking Boxing (a) λ = .01 (80.7%) September 19, 2013, Zagreb, Croatia Boxing Proceedings of the Croatian Computer Vision Workshop, Year 1 337 21 25 7 12 7 43 318 51 3 4 0 9 51 321 0 0 0 0 3 0 287 64 16 5 3 1 71 318 6 2 0 0 32 2 371 TABLE II: The confusion table for the best-performing classifier that uses STA2 feature vectors based on the TV-L1 optical flow to represent KTH action videos. Vertical axis: the correct class label, horizontal axis: distribution over predicted labels. groups: boxing, clapping and waving, and jogging, running and walking. Clapping, waving and boxing are visually similar, as they both include arm movement and static legs. Although in the KTH sequences the boxing action is filmed with the person facing sideways, so the arm movement should occur only on one side of the bounding box, due to noisy annotations it is common that the bounding box of a clapping or a waving action includes only one arm of a person, resulting in a similar motion pattern. The jogging, running, and walking actions are also similar, due to the movement of the legs and the general motion of the body. The greatest confusion is between jogging and running, which is understandable given the variations of performing these actions among actors. Some actors run very similarly to the jogging of others, and vice versa. For the TV-L1-based STA descriptors, the best obtained recognition rate was 81.6%, obtained for a 3 × 6 grid, with 8-bin grid histograms and 5-bin STA2 histograms. As less flow information is generated when using TV-L1, the STA descriptors seem to benefit from using a coarser grid than in the Farnebäck case. The confusion table for the bestperforming TV-L1-based descriptor is shown in Table II. Again, the most commonly confused classes can be grouped into two groups: boxing, clapping and waving, and jogging, running and walking. The Farnebäck-based descriptors seem to be better in separating examples among these two groups (e.g. when using Farnebäck-based descriptors no jogging actions got classified as boxing, clapping and waving, while when using TV-L1-based descriptors 10 of them did). 13 Proceedings of the Croatian Computer Vision Workshop, Year 1 V. C ONCLUSION In this paper we proposed an extension of the STA descriptors with optical flow and applied the concept to the problem of human action recognition. A detailed experimental evaluation of two different optical flow algorithms has been provided, with an in-depth study of the properties of individual parameters of each algorithm. We obtained encouraging performance rates, with the descriptors based on the Farnebäck optical flow performing slightly better than the descriptors based on TV-L1. The obtained results suggest that a combination of STA descriptors and optical flow could be used as a feasible representation in a setting that requires partial action models. In [31], similar performance (83.4%) on the KTH action dataset was obtained using gradient-based STA descriptors, and that approach generalized well to a partial action setting. Although better performance rates have been obtained on the KTH dataset [32], our approach is simple, easily extended to other applications, and suitable for building a refinable representation. Therefore, we believe that it merits further investigation. In future work, we plan to obtain better annotations of humans in video in hopes of improving the overall performance, explore the suitability of optical flow-based STAs to partial action data, and train an SVM classifier with optimized parameters and compare it with the random forest classifier. ACKNOWLEDGMENT This research has been supported by the Research Centre for Advanced Cooperative Systems (EU FP7 #285939). R EFERENCES [1] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake, “Real-Time Human Pose Recognition in Parts from Single Depth Images,” 2011. [Online]. Available: http://research.microsoft.com/apps/pubs/default.aspx?id=145347 [2] R. Manduchi, S. Kurniawan, and H. Bagherinia, “Blind guidance using mobile computer vision: A usability study,” in ACM SIGACCESS Conference on Computers and Accessibility (ASSETS), 2010. [Online]. Available: http://users.soe.ucsc.edu/~manduchi/papers/ assets1065o-manduchi.pdf [3] C. Chen and J.-M. Odobez, “We are not contortionists: Coupled adaptive learning for head and body orientation estimation in surveillance video,” in Proc. CVPR, 2012, pp. 1544–1551. [4] K. Brkić, A. Pinz, S. Šegvić, and Z. Kalafatić, “Histogrambased description of local space-time appearance.” in SCIA, ser. Lecture Notes in Computer Science, A. Heyden and F. Kahl, Eds., vol. 6688. Springer, 2011, pp. 206–217. [Online]. Available: http://dblp.uni-trier.de/db/conf/scia/scia2011.html#BrkicPSK11 [5] G. Farnebäck, “Two-frame motion estimation based on polynomial expansion,” in Proceedings of the 13th Scandinavian Conference on Image Analysis, ser. SCIA’03. Berlin, Heidelberg: Springer-Verlag, 2003, pp. 363–370. [6] C. Zach, T. Pock, and H. Bischof, “A duality based approach for realtime tv-l1 optical flow,” in Proceedings of the 29th DAGM conference on Pattern recognition. Berlin, Heidelberg: Springer-Verlag, 2007, pp. 214–223. [7] V. Krüger, D. Kragić, A. Ude, and C. Geib, “The Meaning of Action: A Review on action recognition and mapping,” Advanced Robotics, vol. 21, no. 13, pp. 1473–1501, 2007. [8] R. Poppe, “A survey on vision-based human action recognition,” Image and Vision Computing, vol. 28, no. 6, pp. 976 – 990, 2010. [9] D. Weinland, R. Ronfard, and E. Boyer, “A survey of vision-based methods for action representation, segmentation and recognition,” Computer Vision and Image Understanding, vol. 115, no. 2, pp. 224 – 241, 2011. CCVW 2013 Oral Session September 19, 2013, Zagreb, Croatia [10] J. Aggarwal and M. Ryoo, “Human activity analysis: A review,” ACM Computing Surveys, vol. 43, no. 3, pp. 16:1–16:43, 2011. [Online]. Available: http://doi.acm.org/10.1145/1922649.1922653 [11] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray, “Visual categorization with bags of keypoints,” in Proceedings of the European Conference on Computer Vision (ECCV), 2004, pp. 1–22. [12] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proceedings of the International Conference on Pattern Recognition, C. Schmid, S. Soatto, and C. Tomasi, Eds., vol. 2, INRIA Rhône-Alpes, ZIRST-655, av. de l’Europe, Montbonnot-38334, 2005, pp. 886–893. [Online]. Available: http://lear.inrialpes.fr/pubs/2005/DT05 [13] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004. [Online]. Available: http://dx.doi.org/10.1023/B: VISI.0000029664.99615.94 [14] H. Wang, M. M. Ullah, A. Kläser, I. Laptev, and C. Schmid, “Evaluation of local spatio-temporal features for action recognition,” in BMVC, 2009. [15] I. Laptev and T. Lindeberg, “Space-time interest points,” in Proc. ICCV, 2003, pp. 432–439. [16] C. Harris and M. Stephens, “A combined corner and edge detector,” in Proceedings of Fourth Alvey Vision Conference, 1988, pp. 147–151. [17] T. Kadir and M. Brady, “Saliency, scale and image description,” International Journal of Computer Vision, vol. 45, no. 2, pp. 83–105, 2001. [Online]. Available: http://dx.doi.org/10.1023/A:1012460413855 [18] A. Oikonomopoulos, I. Patras, and M. Pantić, “Spatiotemporal salient points for visual recognition of human actions,” Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 36, no. 3, pp. 710–719, 2005. [Online]. Available: http://dx.doi.org/10.1109/TSMCB. 2005.861864 [19] G. Willems, T. Tuytelaars, and L. Van Gool, “An efficient dense and scale-invariant spatio-temporal interest point detector,” in Proceedings of the European Conference on Computer Vision (ECCV). Berlin, Heidelberg: Springer-Verlag, 2008, pp. 650–663. [Online]. Available: http://dx.doi.org/10.1007/978-3-540-88688-4_48 [20] P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior recognition via sparse spatio-temporal features,” in VS-PETS, 2005. [21] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld, “Learning realistic human actions from movies,” in Proc. CVPR, 2008. [Online]. Available: http://lear.inrialpes.fr/pubs/2008/LMSR08 [22] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (SURF),” Computer Vision and Image Understanding, vol. 110, no. 3, pp. 346–359, 2008. [Online]. Available: http: //dx.doi.org/10.1016/j.cviu.2007.09.014 [23] C. Schüldt, I. Laptev, and B. Caputo, “Recognizing human actions: A local SVM approach,” in Proceedings of the International Conference on Pattern Recognition, 2004, pp. 32–36. [Online]. Available: http://dx.doi.org/10.1109/ICPR.2004.747 [24] M. D. Rodriguez, J. Ahmed, and M. Shah, “Action MACH: a spatiotemporal maximum average correlation height filter for action recognition,” in Proc. CVPR, 2008. [25] M. Marszałek, I. Laptev, and C. Schmid, “Actions in context,” in Proc. CVPR, 2009. [Online]. Available: http://lear.inrialpes.fr/pubs/ 2009/MLS09 [26] B. K. P. Horn and B. G. Schunck, “Determining optical flow,” Artificial Intelligence, vol. 17, pp. 185–203, 1981. [27] J. Sánchez Pérez, E. Meinhardt-Llopis, and G. Facciolo, “TV-L1 Optical Flow Estimation,” Image Processing On Line, vol. 2013, pp. 137–150, 2013. [28] Z. Lin, Z. Jiang, and L. S. Davis, “Recognizing actions by shape-motion prototype trees,” in Proc. ICCV, 2009, pp. 444–451. [29] G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal of Software Tools, 2000. [30] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA data mining software: an update,” ACM SIGKDD Explorations Newsletter, vol. 11, no. 1, pp. 10–18, 2009. [31] K. Brkić, “Structural analysis of video by histogram-based description of local space-time appearance,” Ph.D. dissertation, University of Zagreb, Faculty of Electrical Engineering and Computing, 2013. [32] S. Sadanand and J. J. Corso, “Action bank: A high-level representation of activity in video,” Proc. CVPR, vol. 0, pp. 1234–1241, 2012. 14 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia An Overview and Evaluation of Various Face and Eyes Detection Algorithms for Driver Fatigue Monitoring Systems Markan Lopar and Slobodan Ribarić Department of electronics, microelectronics, computer and intelligent systems Faculty of Electrical Engineering and Computing, University of Zagreb Zagreb, Croatia [email protected], [email protected] Abstract—In this work various methods and algorithms for face and eyes detection are examined in order to decide which of them are applicable for use in a driver fatigue monitoring system. In the case of face detection the standard Viola-Jones face detector has shown best results, while the method of finding the eye centers by means of gradients has proven to be most appropriate in the case of eyes detection. The later method has also a potential for retrieving behavioral parameters needed for estimation of the level of driver fatigue. This possibility will be examined in future work. Keywords—driver fatigue monitoring; face detection; eyes detection; behavioral parameters I. INTRODUCTION The driver’s loss of attention due to drowsiness or fatigue is one of the major contributors in road accidents. According to National Highway Traffic Safety Administration [1] more than 100,000 crashes per year in United States are caused by drowsy driving. This is a conservative estimation, because the drowsy driving is underreported as a cause of crashes. Also the accidents caused by driver’s inattention are not accounted. According to some French statistics for 2003 [2], more accidents were caused by inattention due to fatigue or sleep deprivation than by alcohol or drug intoxication. Therefore, significant efforts were made over the past two decades to develop robust and reliable safety systems intended to reduce the number of accidents, as well as to introduce new, or improve existing techniques and technologies for development of these systems. Different techniques are used in driver-fatigue monitoring systems. These techniques are divided into four categories [3]. The first category includes intrusive techniques, which are mostly based on monitoring biomedical signals, and therefore require physical contact with the driver. The second category includes non-intrusive techniques based on visual assessment of driver’s bio-behavior from face images. The third category includes methods based on driver’s performance, which monitor vehicle behaviors such as moving course, steering angle, speed, braking, etc. Finally, the fourth category combines techniques from the abovementioned three categories. The computer vision based techniques from the second category are particularly effective, because the drowsiness can be detected by observing the facial features and visual biobehavior such as head position, gaze, eye openness, eyelid movements, and mouth openness. This category encompasses a wide range of methods and algorithms with respect to data acquisition (standard, IR and stereo cameras), image processing and feature extraction (Gabor wavelets, Gaussian derivatives, Haar-like masks, etc.), and classification (neural networks, support vector machines, knowledge-based systems, etc.). A widely accepted parameter that is measured by most of methods is percentage of eye closure over time [4], commonly referred as PERCLOS. This measure was established by in [5] as the proportion of time in a specific time interval during which the eyes are at least 80% closed, not including eye blinks. Also, there are other parameters that can be measured such as average eye closure speed (AECS) proposed by Ji and Yang [6]. The initial task in development of driver fatigue monitoring system is to build a module for reliable real-time detection and tracking of facial features such as face, eyes, mouth, gaze, and head movements. This paper examines some of existing techniques for face and eyes detection. II. RELATED WORKS Many efforts in about past twenty years were made to develop feature detection systems for purpose of driver fatigue monitoring. One of widely popular methods is to use active illumination for eye detection. Grace et al. [7] have used special camera for capturing images of a driver at different wavelengths. Using the fact that at different wavelengths human retina reflects different amount of light, they subtract two images of a driver taken in this way in order to obtain an image which contains non-zero pixels at the eyes locations. Many variations of this image subtracting technique are implemented later. One such notable implementation is presented by Ji and Yang [6]. In their work they use active infra-red illumination system which consists of CCD camera with two sets of infra-red LEDs distributed evenly and This work is supported by EU-funded IPA project “VISTA – Computer Vision Innovations for Safe Traffic”. CCVW 2013 Oral Session 15 Proceedings of the Croatian Computer Vision Workshop, Year 1 symmetrically along the circumference of two coplanar and concentric rings, where the center of both rings coincides with the camera optical axis. The light from two rings enters the eye at different angle resulting in images with bright pupil effect in case when the inner ring is turned on, and with dark pupil effect when the outer ring is switched on. The resulting images are subtracted in order to obtain the eyes positions. This method of eye detection is used by many of other researchers [8, 9]. While the abovementioned approaches use active illumination and as such rely on special equipment, there are also methods that employ other strategies. Thus Smith, Shah, and da Vitoria Lobo [10] present a system which relies on estimation of global motion and color statistics to track a person’s head and facial features. For the eyes and lips detection they employ color predicates, histogram-like structures where the cells are labeled as either a members or non-members of desired color range [11]. Another work that uses color information for facial features detection is presented by Rong-ben, Ke-you, Shu-ming, and Jiang-wei [12]. Their method is based on the assumption that in RGB color space red and green components of human skin follow Gaussian planar distribution. The R and G values of each pixel are examined, and if they fall within 3 standard deviations of R and G’s means, the pixel is considered to belong to skin. When the face is localized the Gabor wavelets are applied for extracting the eye features. The same procedure is later applied for mouth detection [13]. D’Orazio, Leo, Spagnolo, and Guaragnella [14] present the method for eyes detection using operator based on Circle Hough Transform [15]. This operator is applied on the entire image and the result is a maximum that represents the region possibly containing an eye. The second eye is then searched in two opposite directions compatible with the range of possible eyes position concerning the distance and orientation between the eyes. The obtained results are subjected to testing of similarity, which is evaluated by calculating the mean absolute error applied on mirrored domains. If this similarity measure falls below certain threshold, the regions are considered the best match for eye candidate. Ribarić, Lovrenčić, and Pavešič [3] have presented a driver fatigue monitoring system which combines a neural-networkbased feature extraction module with knowledge-based inference module. The feature extraction module consists of several components that perform various tasks such as face detection, detection of in-plane and out-of-plane head rotations, and estimation of eye and mouth openness. The face detection procedure uses combined appearance-based and neural network approaches. The first step of algorithm applies a hybrid method based on some features of the HMAX model [16] and ViolaJones face detector [17]: A linear template matching is performed using fixed-size Haar-like masks, and then a nonlinear MAX operator is applied in order to gain invariance to transformations of an input image. The features obtained by applying MAX operator are used as inputs to the 3-layered perceptron in the second step of face detection algorithm. These features are chosen from fixed-size window that slides through the maps of features, and the value of a single output neuron for each position of the sliding window is recorded in CCVW 2013 Oral Session September 19, 2013, Zagreb, Croatia an activation map. The activation map is binarized afterwards, and the center of gravity of largest region with non-zero value is considered to be the center of detected face. The estimation of the angle of the in-plane rotation is implemented using 5layered convolutional neural network. The inputs are the features obtained in the face detection routine before applying MAX operator. The output layer has 36 neurons which correspond to various rotation angles. The estimation of the angle of the out-of-plane rotation is carried out in a similar manner, except that the angles of pan and tilt rotations are estimated from horizontal and vertical offsets from the center of the face and the point that lies in the intersection of the nose symmetry axis and the line connecting the eyes. The location of the eyes and mouth is determined on the basis of the center of the face, as well as the angles of in-plane and out-of-plane rotations. Two separate convolutional neural networks are employed to determine the eye and mouth openness. Two more approaches are presented here which deal with the problem of eyes detection. Though they are originally not discussed in the context of driver fatigue monitoring, they can nevertheless be applied for that purpose. The first work is by Valenti and Gevers [18]. Their approach is based on the observation that eyes are characterized by radially symmetric brightness pattern, and isophotes are used to infer the center of circular patterns. The center is obtained by voting mechanism which is used to increase and weight important votes to reinforce center estimates. The second method is proposed by Timm and Barth [19]. It is relatively simple method which is based on fact that all circle gradients intersect in the center of a circle. Thus the eye centers can be obtained at locations where most of the image gradients intersect. Moreover, this method is robust enough to detect partially occluded circular objects, e.g. half-open eyes. This is great advantage over the Circle Hough Transform which in this case would fail. III. ANALYZING OF VARIOUS METHODS FOR FACE AND EYES DETECTION The aim of this work is to test and compare some methods for face and eyes detection and to decide which of them are feasible for implementation in driver fatigue monitoring system. The particular attention has been paid to accuracy in feature detection and ability to operate in real time. The selected methods are tested on video sequences captured with web camera in dark and light ambient and with various background scenery. The tests are performed on the machine with two cores, each working at 2,4 GHz. Since the extensive quantitative measurements are not available at the moment, the obtained results are presented in descriptive form only. These results are acquired by visual inspection: the accuracy is estimated by observing the frequency of correctly detected features, and feasibility for real time processing is evaluated with regard to how smoothly the respective algorithm runs. A. Face Detection Face detectors are implemented using several methods. These methods include face detection by means of searching for ellipses, detection using SIFT [20] and SURF features [21], application of Viola-Jones face detector [17], and use of feature extraction module described in [3]. 16 Proceedings of the Croatian Computer Vision Workshop, Year 1 Since the human head is shaped close to an ellipse, it is convenient to try to detect it by searching for ellipses in a given image. Ellipses can be found by Generalized Hough Transform [22], but this procedure is computationally intensive. Prakash and Rajesh [23] are recently presented the method for ellipse detection which reduces the whole issue to application of Circle Hough Transform, and this method is used in this work. Nevertheless, it is not fast enough to operate in real time. Also, since there are lots of false positives, it is necessary to implement some learning procedure, and this additionally slows down the detection procedure. SIFT and SURF are well-known algorithms for object matching. Both of these methods perform well in terms of matching of facial features in template and target image, with SURF running slightly faster. Unfortunately, both of them have one major drawback: they are very sensitive to even small changes in illuminating conditions. An attempt was made to circumvent this problem by applying histogram equalization and histogram fitting [24], but with little success, since in that case the detection runs much slower. The Viola-Jones detector is well-known face detector that has found its use in many applications that involve face detection. It performs very well considering both accuracy and speed needed for real-time processing. The facial feature extractor described in [3] which implements some features of Viola-Jones detector, also performs the task of face detection very well. The advantages and drawbacks of each face detector are summarized in TABLE I. TABLE I. Detection method ADVANTAGES AND DISADVANTAGES OF VARIOUS FACE DETECTION METHODS Advantages Disadvantages Ellipse detection Matches well the shape of human head SIFT An excellent method for object matching SURF An excellent method for object matching, faster than SIFT Viola-Jones Accurate and very fast Ribarić, Lovrenčić, Pavešič [3] Accurate and fast Very slow, high number of false positives Very high sensitivity to changes in illumination conditions Very high sensitivity to changes in illumination conditions Small number of false positives, needs learning Small number of false positives, long training procedure B. Eyes Detection Four methods are used for the eyes detection. These include Viola-Jones eyes detector, the algorithm described by Valenti and Gevers [18], the approach proposed by Timm and Barth [19], and feature extractor by Ribarić, Lovrenčić, and Pavešič [3]. All of these methods rely on previously detected face and for purpose of face detection the Viola-Jones detector is used, since it performs the best. The exception is the method by Ribarić, Lovrenčić, and Pavešič, since it uses its own face detector. CCVW 2013 Oral Session September 19, 2013, Zagreb, Croatia The Viola-Jones performs reasonably well, but it produces a slightly higher amount of false positives, especially in the regions of mouth corners. The detector described by Valenti and Gevers performs excellent regarding both speed and accuracy. It never fails to find the eyes, but it is not perfect in accurate localization of eye centers, since it more often than not misdetects the eye corner as the eye center. The detector by Timm and Barth performs the best of all four eye detectors. It runs slightly slower than the detector by Valenti and Gevers, but it almost never fails to find accurately the eye center. Finally, the feature extractor by Ribarić, Lovrenčić, and Pavešič is not as good in the task of eye detection as in the case of face detection. The cause of this lies in the dependency chain of detected features. The detected face is prerequisite for in-plane rotation detection. The out-of-plane rotation detection needs previously detected in-plane rotation, and finally the eyes and mouth detection takes place after the out-of-plane rotation is detected. Thus the error detection grows larger in each step of this cascade, and it has highest value at the end of the chain, i.e. in case of eyes and mouth detection. The advantages and drawbacks of each eyes detector are summarized in TABLE II. TABLE II. ADVANTAGES AND DISADVANTAGES OF VARIOUS EYES DETECTION METHODS Detection method Viola-Jones Valenti and Gevers Timm and Barth Ribarić, Lovrenčić, Pavešič [3] IV. Advantages Accurate and very fast Reliable and very fast Very accurate in eye center detection Fast detector Disadvantages Has more false positives than the face detector, needs learning Misdetects the eye corners as the eye centers Slower than detector by Valenti and Gevers Lower accuracy due to features dependency chain CONCLUSIONS AND FUTURE WORK In this work we have tested some methods for face and eyes detection in order to decide which of them is the best for implementation in driver fatigue monitoring system. ViolaJones face detector has proven to be fastest and most accurate among the face detectors, while the algorithm proposed by Timm and Barth outperformed the rest of eyes detectors. The eyes detection method described by Timm and Barth has also one interesting property that may be utilized in driver fatigue monitoring systems. Namely, by simple counting the number of intersecting gradients and comparing it to some referent value it is possible to determine the degree of eyes openness/closeness and to calculate some behavioral parameters such as PERCLOS. This possibility will definitely be examined in the future. 17 Proceedings of the Croatian Computer Vision Workshop, Year 1 The future work also includes researching and development of methods for mouth detection and determination of degree of mouth openness. The use of thermovision camera for image acquisition is also planned in the future in order to examine whether some facial features can be obtained from images acquired in this way. Thermovision cameras are at the moment too expensive to be used in commercial driver fatigue monitoring systems, but their potential usefulness nevertheless may and should be examined in researching activities. After the facial features are obtained, a certain number of behavioral parameters – such as PERCLOS, degree of mouth openness, degree of head rotation, and gaze estimation – will be extracted from facial features. These parameters will be used in inference module which should decide whether the driver is fatigued or not, and issue an appropriate warning or alarm signal if it is necessary. In any case, the task of monitoring driver fatigue is a challenge to be handled, and many efforts should be made to deal with it successfully. September 19, 2013, Zagreb, Croatia [9] [10] [11] [12] [13] [14] [15] REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] Research on Drowsy Driving, National Highway Traffic Safety Administration, [online] Available: http://www.nhtsa.gov/Driving+Safety/Distracted+Driving/Research+on +Drowsy+Driving M. Kutila, “Methods for Machine Vision Based Driver Monitoring Applications”, VTT Technical Research Centre of Finland, pp. 1-70, 2006. S. Ribarić, J. Lovrenčić, N. Pavešić, “A Neural-Network-Based System for Monitoring Driver Fatigue”, MELECON 2010 – 2010 15th IEEE Mediterranean Electrotechnical Conference, pp. 1356-1361, Apr. 2010, Valletta. R. Knipling, and P. Rau, “PERCLOS: A valid psychophysiological measure of alertness as assessed by psychomotor vigilance”, Office of Motor Carriers, FHWA, Tech. Rep. FHWA-MCRT-98-006, Oct. 1998, Washington DC. W. W. Wierwille, L. A. Ellsworth, S. S. Wreggit, R. J. Fairbanks, C. L. Kirn, “Research on Vehicle-Based Driver Status/Performance Monitoring: Development, Validation, and Refinement of Algorithms for Detection of Driver Drowsiness”, National Highway Traffic Safety Administration Final Report: DOT HS 808 247, 1994. Q, Ji, and X. Yang, “Real-Time Eye, Gaze, and Face Pose Tracking for Monitoring Driver Vigilance”, Real-Time Imaging 8, pp. 357-377, 2002. R. Grace, V. E. Byrne, D. M. Bierman, J.-M. Legrand, D. Gricourt, B. K. Davis, J. J. Staszewski, B. Carnahan, “A Drowsy Driver Detection System for Heavy Vehicles”, Proc. of the 17th DASC. AIAA/IEEE/SAE. Digital Avionics System Conference, vol. 2, issue 36, pp. 1-8, Nov. 1998, Bellevue, WA. H. Gu, Q. Ji, Z. Zhu, “Active Facial Tracking for Fatigue Detection”, Proc. of the 6th IEEE Workshop on Applications of Computer Vision, pp. 137-142, Dec. 2002, Orlando, FL. CCVW 2013 Oral Session [16] [17] [18] [19] [20] [21] [22] [23] [24] Q. Ji, Z. Zhu, P. Lan, “Real-Time Nonintrusive Monitoring and Prediction of Driver Fatigue”, IEEE Transactions on Vehicular Technology, vol. 53, No. 4, pp. 1052-1068, July 2004. P. Smith, M. Shah, N. da Vitoria Lobo, “Determining Driver Visual Attention with One Camera”, IEEE Transactions on Intelligent Transportation Systems, vol. 4, No. 4, pp. 205-218, December 2003. R. Kjeldsen, and J. Kender, “Finding Skin in Color Images”, Proc. of the 2nd International Conference on Automatic Face and Gesture Recognition, pp. 312-317, Oct. 1996, Killington, VT. W. Rong-ben, G. Ke-you, S. Shu-ming, C. Jiang-wei, “A Monitoring Method of Driver Fatigue Behavior Based on Machine Vision”, Intelligent Vehicles Symposium, 2003. Proceedings. IEEE, pp. 110-113, Jun. 2003. W. Rongben, G. Lie, T. Bingliang, J. Lisheng, “Monitoring Mouth Movement for Driver Fatigue or Distraction with One Camera”, Proc. of the 7th International IEEE Conference on Intelligent Transportation Systems, pp. 314-319, Oct. 2004, Washington, DC. T. D’Orazio, M. Leo, P. Spagnolo, C. Guaragnella, “A Neural System for Eye Detection in a Driver Vigilance Application”, Proc. of the 7th International IEEE Conference on Intelligent Transportation Systems, pp. 320-325, Oct. 2004, Washington, DC. R.O. Duda, P.E. Hart, “Use of the Hough Transform to detect lines and curves in pictures,” Comm. ACM 15, pp. 11-15, 1972. T. Serre, L. Wolf, T. Poggio, “Object Recognition with Features Inspired by Visual Cortex”, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 2, pp. 9941000, Jun. 2005, San Diego, CA. P. Viola, and M. Jones, “Robust Real-Time Object Detection”, 2nd International Workshop on Statistical and Computational Theories of Vision – Modeling, Learning, Computing, and Sampling, Jul. 2001, Vancouver. R. Valenti, and T. Gevers, “Accurate Eye Center Location Through Invariant Isocentric Patterns”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, pp. 1785-1798, Sept. 2012. F. Timm, and E. Barth, “Accurate eye centre localisation by means of gradients”, In: Int. Conference on Computer Vision Theory and Applications (VISAPP), vol. 1, pp. 125-130, March 2011, Algarve, Portugal. D.G. Lowe, “Object recognition from local scale-invariant features”, Proceedings of the Seventh IEEE International Conference on Computer Vision, vol 2, pp. 1150-1157, Sept. 1999, Kerkyra, Greece. H. Bay, A. Ess, T. Tuytelaars, L. van Gool, “SURF: Speeded up robust features”, Computer Vision and Image Understanding (CVIU), vol. 110, No. 3, pp. 346-359, 2008. D.H. Ballard, “Generalizing the Hough Transform to detect arbitrary shapes”, Pattern Recognition, vol. 13, No. 2, pp. 111-122, 1981. J. Prakash, and K. Rajesh, “Human Face Detection and Segmentation using Eigenvalues of Covariance Matrix, Hough Transform and Raster Scan Algorithms”, World Academy of Science, Engineering and Technology 15, 2008. R.C. Gonzalez, and R.E. Woods, “Digital image processing”, Prentice Hall, 3rd ed., 2007. 18 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia Classifying traffic scenes using the GIST image descriptor Ivan Sikirić Karla Brkić, Siniša Šegvić Mireo d.d. Buzinski prilaz 32, 10000 Zagreb e-mail: [email protected] University of Zagreb Faculty of Electrical Engineering and Computing e-mail: [email protected] Abstract—This paper investigates classification of traffic scenes in a very low bandwidth scenario, where an image should be coded by a small number of features. We introduce a novel dataset, called the FM1 dataset, consisting of 5615 images of eight different traffic scenes: open highway, open road, settlement, tunnel, tunnel exit, toll booth, heavy traffic and the overpass. We evaluate the suitability of the GIST descriptor as a representation of these images, first by exploring the descriptor space using PCA and k-means clustering, and then by using an SVM classifier and recording its 10-fold cross-validation performance on the introduced FM1 dataset. The obtained recognition rates are very encouraging, indicating that the use of the GIST descriptor alone could be sufficiently descriptive even when very high performance is required. I. I NTRODUCTION This paper aims to improve current fleet management systems by introducing visual data obtained by cameras mounted inside vehicles. Fleet management systems track the position and monitor the status of many vehicles in a fleet, to improve the safety of their cargo and to prevent their unauthorized and improper use. They also accumulate this data for the purpose of generating various reports which are then used to optimise the use of resources and minimise the expenses. Fleet management systems usually consist of a central server to which hundreds of simple clients report their status. Clients are typically inexpensive embedded systems placed inside a vehicle, equipped with a GPS chip, a GPRS modem and optional additional sensors. Their purpose is to inform the server of the vehicle’s current position, speed, fuel level, and other relevant information in regular intervals. Our contribution is introduction of visual cues to the vehicle’s status. The server could use these cues to infer the properties of the vehicle’s surroundings, which would help it in the further decision making. For example, server could infer the location of the vehicle (e.g. open road, tunnel, gas station), or cause of stopping (e.g. congestion, traffic lights, road works). In cases of missing or inaccurate GPS data, the fleet management server could use the location inferred from visual data to determine which of the several possible vehicle routes is the correct one. In some cases the server is not even aware of the loss of GPS precision, and fails to discard improbable data. This usually occurs in closed environments and near tall objects. Detecting such scenarios using visual data would be very beneficial. Also, detecting the cause of losing the GPS signal and the cause of vehicle stopping is important in systems that offer real-time tracking of vehicles used in transporting valuables. CCVW 2013 Oral Session We aim to solve this problem by learning a classification model for each scene or scenario. Transmitting the entire image taken from a camera in every status report would raise the size of transmitted data by several orders of magnitude (typically the size of status is less than a hundred bytes), which would be too expensive. This can be resolved by calculating the descriptor of the image on the client itself, before transmitting it to the server for further analysis. We chose to use the GIST descriptor by Oliva and Torralba [1] because it describes the shape of the scene using low dimensional feature vectors, and it performs well in scene classification setups. Our second contribution is introduction of a new dataset, called the FM1 dataset, containing 5615 traffic scenes and associated labels. Using this dataset we perform experimental evaluation of the proposed method and report some preliminary results. The remainder of the paper is organized as follows: In the next section we give an overview of previous related work, followed by a brief description of the GIST descriptor. We then describe the introduced dataset in detail, and explore the data using well known techniques. After that we define the classification problem, describe the classification setup and present the results. We conclude the paper by giving an overview of contributions and discussing some interesting future directions. II. R ELATED WORK Although image/scene classification is a very active topic in computer vision research, the work specific to road scene classification is limited. Bosch et al. [2] divide image classification approaches into low-level approaches and semantic approaches. Low-level approaches, e.g. [3], [4], model the image using low-level features, either globally or by dividing the image into sub-blocks. Semantic approaches aim to add a level of understanding what is in the image. Bosch et al. [2] identify three subtypes of semantic approaches: (i) methods based on semantic objects, e.g. [5], [6], where objects in the image are detected to aid scene classification, (ii) methods based on local semantic concepts, such as the bag-of-visualwords approach [7], [8], [9], where meaningful features are learned from the training data, and methods based on semantic properties, such as the GIST descriptor [1], [10], where an image is described by a set of statistical properties, such as roughness or naturalness. One might argue that the GIST descriptor should be categorized as a low-level approach, as it uses low-level features to obtain the representation. However, 19 Proceedings of the Croatian Computer Vision Workshop, Year 1 unlike low-level approaches [3], [4] where local features are the representation, the GIST descriptor merely uses low-level features to quantify higher-level semantic properties of the scene. Ess et al. [11] propose a segmentation-based method for urban traffic scene understanding. An image is first divided into patches and roughly segmented, assigning each patch one of the thirteen object labels including car, road etc. This representation is used to construct a feature set fed to a classifier that distinguishes between different road scene categories. Tang and Breckon [12] propose extracting a set of color, edge and texture-based features from predefined regions of interest within road scene images. There are three predefined regions of interest: (i) a rectangle near the center of the image, sampling the road surface, (ii) a tall rectangle on the left side of the image, sampling the road side, and (iii) a wide rectangle on the bottom of the image, sampling the road edge. Each predefined region of interest has its own set of preselected color features, including various components of RGB, HSV and YCrCb color spaces. The texture features are based on grey-level co-occurrence matrix statistics and Gabor filters. Additionally, edge-based features are extracted in the road edge region. A training set of 800 examples of four road scene categories is introduced: motorway, offroad, trunkroad and urban road. The testing is performed on approximately 600 test image frames. The k-NN and the artificial neural network classifiers are considered. The best obtained recognition rate is 86% when considering all four classes as separate categories, improving to 90% when classes are merged into two categories: off-road and urban. Mioulet et al. [13] consider using Gabor features for road scene classification, using the dataset of Tang and Breckon [12]. Gabor features are extracted within the same three regions of interest used by Tang and Breckon. Grayscale histograms are built from Gabor result images and concatenated over all three regions of interest to form the final descriptor. A random forest classifier is trained and evaluated on a dataset from [12] consisting of four image classes: motorway, offroad, trunkroad and urban road, achieving the 10-fold cross-validation recognition rate of 97.6%. In this paper, we are limited by our target application that imposes a constraint on bandwidth and processing time. Therefore, sophisticated approaches that require a lot of processing power to obtain the scene feature vector, such as the segmentation-based method of Ess et al. [11], or the manyfeature method of Tang and Breckon [12], are unsuitable for our problem. The closest to our application is the work of Mioulet et al. [13], where a simple Gabor feature-based approach performs better on the same dataset than the various preselected features of Tang and Brackon. We take the idea of Mioulet et al. a step further by using Gabor feature-based GIST descriptor, which we hope will be an improvement over using raw Gabor features. Furthermore, as our goal is classifying the road scene into a much larger number of classes than are available in the dataset of Tang and Breckon [12], we introduce a new dataset of 8 scene categories. CCVW 2013 Oral Session September 19, 2013, Zagreb, Croatia III. T HE GIST DESCRIPTOR The GIST descriptor [1], [14] focuses on the shape of scene itself, on the relationship between the outlines of the surfaces and their properties, and ignores the local objects in the scene and their relationships. The representation of the structure of the scene, termed spatial envelope is defined, as well as its five perceptual properties: naturalness, openness, roughness, expansion and ruggedness, which are meaningful to human observers. The degrees of those properties can be examined using various techniques, such as Fourier transform and PCA. The contribution of spectral components at different spatial locations to spatial envelope properties is described with a function called windowed discriminant spectral template (WDST), and its parameters are obtained during learning phase. The implementation we used first preprocesses the input image by converting it to grayscale, normalizing the intensities and locally scaling the contrast. The resulting image is then split into a grid on several scales, and the response of each cell is computed using a series of Gabor filters. All of the cell responses are concatenated to form the feature vector. We believe this descriptor will perform very well in the context of traffic scenes classification. Expressing the degree of naturalness of the scene should enable us to differentiate urban from open road environments (even more so in case of unpaved roads). The degree of openness should differentiate the open roads from tunnels and other types of closed environments, and is especially useful in the context of fleet management, where closed environments usually result in the loss of GPS signal. Texture and shape of large surfaces is also taken into account, and this could help separate the highways from other types of roads. IV. T HE FM1 DATASET Since we want to build a classifying model for traffic scenes and scenarios relevant to the fleet management problems, we introduce a novel dataset of labeled traffic scenes, called the FM1 dataset1 . The data was acquired using a camera mounted on a windshield of a personal vehicle. Videos were recorded at 30 frames per second, with the resolution of 640x480. Several drives were made on Croatian roads, ranging in length from 37 to 60 minutes, to a total of about 190 minutes, see Table I. All videos were recorded on a clear sunny day, and the largest percentage of the footage was recorded on highways. The camera, its position and orientation were not changed between videos. We plan to vary the camera and its position in the later versions of the dataset, as well as to include nighttime images, and images taken during moderate to heavy rain (when the windshield wipers are operating). A typical frame of a video is shown in Figure 1. Some parts of the image are not parts of the traffic scene, but rather the interior of a vehicle, such as camera mount visible in the upper right corner, the dashboard in the bottom part, and occasional reflections of car interior visible on the windshield. The windshield itself can be dirty, and various artefacts can appear on it depending on the position of the sun. The camera 1 http://www.zemris.fer.hr/~ssegvic/datasets/fer-fm1-2013-09.zip 20 Proceedings of the Croatian Computer Vision Workshop, Year 1 Fig. 1: A typical traffic scene. Note the camera mount (1), the dashboard (2), the reflection of car interior (3) and the speck of dirt (4) TABLE I: Overview of traffic videos Video # 1 2 3 4 Duration 54:08 39:33 1:00:00 37:11 Number of extracted images 1612 1181 1766 1106 mount and the dashboard are in identical position in all of the images, so they should not influence the classification results. However, we can not expect this will always be the case in the future and plan to develop a system that will be robust enough to accommodate for those changes. For each recorded video, every 60th frame was extracted, i.e. one every two seconds. This ensured there was no bias in the selection process, and that subsequently selected images would not be too similar. Choosing the classes for this preliminary classification evaluation was not an easy task. Perhaps the most obvious classes are "tunnel" and "open highway", but even they are not clearly defined. Should we insist the image be classified as an open highway if there is a traffic jam, and almost entire scene is obstructed by a large truck directly in front of us? We can introduce the "heavy traffic" class to accommodate for this case. What if it is not a truck, but a car instead, or if it is a bit further away, so more of the highway is visible? There are many cases in which it is impossible to set the exact moment when a scene transforms from being a member of one class and becomes a member of another class. Perhaps these issues can be resolved by allowing each scene to be assigned multiple class labels, which we plan to explore at later stages of our research. Before choosing the class labels, we first analyzed the data using PCA and clustering algorithms, to see if there are obvious and easily separable classes in the dataset. CCVW 2013 Oral Session September 19, 2013, Zagreb, Croatia Fig. 2: Data points projected into 2D (note the three clusters) A. Exploring the data The first step in data analysis was computing the descriptor for every image. We used the MATLAB implementation of the GIST descriptor provided by Oliva and Torralba [1], using the default parameters, which produces a feature vector of length 512. Since the resolution of the original image is 640x480, this provides a big dimensionality reduction. The PCA was used to find the principal components in data, which was then projected into a plane determined by the first two principal components. The results can be seen in Figure 2. We can see most of the data points belong to one of three clusters, with some data points appearing to be outliers. Around 3300 points belong to the cluster 1, 90% of them corresponding to the scenes of open highway. The rest of the points in cluster 1 correspond to various types of scenes, but not a single one is a scene inside of a tunnel. Around 1400 data points belong to the cluster 2, 70% of them corresponding to the scenes of open highway, and 20% to the scenes of other types of open road. Around 820 points belong to the cluster 3, 45% of them corresponding to scenes inside the tunnels, 35% to the open highway, 6% of them to the scenes at the or under the toll booth. More than 95% of tunnel scenes belong to this cluster, and most of the images in this cluster depict closed environments. The next step was using clustering algorithms to discover easily solvable classes in the dataset. K-means clustering was used on the dataset with values for K varying from 2 to 20. For the case K = 2, one cluster contained almost all of the tunnels and toll booth scenes, as was expected, but it also contained about one third of the open road scenes. For the case K = 3, one cluster contained most of the tunnels and toll booth scenes, but very few open road scenes. The second cluster contained almost all of the scenes with large number of other vehicles, i.e. scenes of heavy traffic. For K = 5, we can notice a good separation of non-highway open roads scenes. Further raising of K shows that similar types of scenes in the urban environments often end up in the same cluster. Some 21 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia (a) highway (b) road (c) tunnel (d) exit (e) settlement (f) overpass (g) booth (h) traffic Fig. 3: Examples of selected classes TABLE II: Selected classes class label highway road tunnel exit settlement overpass booth traffic scene description an open highway an open non-highway road in a tunnel, or directly in front of it, but not at the tunnel exit directly at the tunnel exit (extremely bright image) in a settlement (e.g. visible buildings) in front of, or under an overpass (the overpass is dominant in the scene) directly in front of, or at the toll booth many vehicles are visible in the scene, or completely obstruct the view TABLE III: distribution of classes across videos Video # 1 2 3 4 Total highway 1382 652 1418 885 4337 road 0 312 140 64 516 tunnel 185 134 7 62 388 exit 12 7 3 9 31 clusters contained very similar images, which suggested some easily separable classes, such as "a scene on a highway with a rocky formation on the right side of the road", but most of those classes were not considered useful to our application. In conclusion, the clustering approach gave us some ideas about easily separable and useful class labels to choose for our final classification experiment. Once we obtain more data, the clustering approach should be reapplied, probably in the form of hierarchical clustering for easier analysis. B. Selected classes The set of classes we chose is listed in the Table II. Some of the classes were chosen because the data analysis indicated they would be easy to classify, and one of the functions of fleet management is to simply archive any data that might be required for purposes yet unknown. As was discussed in the introduction, we are very interested in detecting the environments in which the loss of GPS signal precision is likely, or the vehicle is likely to stop or drive slowly. The tunnel is a class which is both easy to classify and often causes loss of GPS signal. The tunnel exit was separated into its own class because the camera reaction to the sudden increase CCVW 2013 Oral Session settlement 0 59 94 23 176 overpass 8 3 13 9 33 booth 0 3 9 43 55 traffic 0 0 74 5 79 in sunlight is very slow, which results in extremely bright images (similar problem is not encountered during tunnel entry). The settlement is an environment in which we are likely to encounter tall objects (which may or may not be visible in the scene), so the loss of GPS signal precision is more likely to occur than on an open road, but less likely then in a tunnel. The overpass is chosen because going under it can cause a slight loss of GPS precision. Toll booths are included because going through them can cause a loss of GPS precision, and also because the vehicle must always stop at a toll booth. Heavy traffic scenario is interesting because it can cause the vehicle to stop or drive very slowly, and because presence of many vehicles can obstruct the view of the camera enough that the proper classification of the location becomes impossible. Intersection is a class which would be very interesting to investigate, but unfortunately we did not have enough samples, and their variability was too great. The distribution of class instances across videos is shown in Table III. V. E XPERIMENTS We used the data mining tool Weka 3 [15] to train several types of general purpose classifiers, performing grid search 22 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia TABLE IV: Detailed accuracy by class Class highway settlement booth tunnel exit overpass traffic road Weighted average TP Rate 0.994 0.864 0.855 0.979 0.903 0.485 0.861 0.899 0.973 FP Rate 0.065 0.003 0 0.001 0 0.001 0 0.007 0.051 Precision 0.981 0.899 0.979 0.992 0.933 0.696 0.971 0.93 0.973 Recall 0.994 0.864 0.855 0.979 0.903 0.485 0.861 0.899 0.973 F-Measure 0.987 0.881 0.913 0.986 0.918 0.571 0.913 0.914 0.973 ROC Area 0.968 0.965 0.99 0.999 1 0.95 0.986 0.957 0.969 TABLE V: Confusion matrix classified as: highway settlement booth tunnel exit overpass traffic road highway 4310 12 5 6 1 15 5 39 settlement 1 152 2 1 0 0 2 11 booth 0 0 47 0 0 0 1 0 tunnel 0 0 0 380 2 1 0 0 exit 1 0 0 1 28 0 0 0 overpass 6 0 0 0 0 16 0 1 traffic 1 0 0 0 0 0 68 1 road 18 12 1 0 0 1 3 464 (a) road (b) settlement (c) traffic (d) tunnel (e) overpass (f) booth Fig. 4: Examples of scenes misclassified as highway optimisation of parameters for each of them. The best results were obtained using SVM classifier with soft margin C = 512 and RBF kernel with γ = 0.125. The testing method was stratified cross-validation with 10 folds, using entire dataset (5615 feature vectors of length 512). Since the dataset is greatly biased towards the highway class, we expect most other classes to be often confused with a highway. The achieved recognition rate was 97.3%. The detailed accuracy by class is shown in Table IV, while the confusion matrix is shown in Table V. We can see the performance is very good across all classes, except the overpass, which is often confused with a highway. This is probably because most of the overpass scenes were in fact on a highway, and the overpass was not equally dominant in all of them, as well as because of low number of class instances. The settlement class is often confused with highway and road scenes. We plan to resolve this by introducing more data, because instances of this class vary in appearance more CCVW 2013 Oral Session then of any other class. We also note that highway is often confused with a plain road, and vice versa, which is to be expected for classes with similar appearance. We do not consider this confusion to be problematic, as fleet management can usually use GPS data to correctly infer the type of the road. Some examples of images misclassified as a highway are shown in Figure 4. VI. C ONCLUSION AND FUTURE WORK Our preliminary results show that the GIST descriptor alone is sufficiently descriptive for the purpose of classification of many types of traffic scenes. This indicates viability of the proposed method as an improvement of current fleet management systems, and invites further research. Further efforts should be directed towards expanding the traffic scenes dataset, to include images during nighttime, and bad weather, as well as images obtained by different cameras mounted at other positions and angles. Also, the set of classes 23 Proceedings of the Croatian Computer Vision Workshop, Year 1 should be expanded to include many other interesting scenes and scenarios. Poor classification of the overpass class should be addressed, at first by adding more data. It is possible that in some cases the overpass can not be considered an integral part of the scene, but rather its attribute. There are also other types of attributes of the traffic scenes that we might want to consider, e.g. traffic lights. We plan to introduce additional image features and detectors of such attributes. One of our future goals would be to reduce or eliminate the need for labeling. We are considering using automatic labeling in cases where we can infer the details of vehicle’s environment with high degree of probability using non-visual cues. September 19, 2013, Zagreb, Croatia [13] L. Mioulet, T. Breckon, A. Mouton, H. Liang, and T. Morie, “Gabor features for real-time road environment classification,” in Proc. International Conference on Industrial Technology, pp. 1117–1121, IEEE, February 2013. [14] M. Douze, H. Jégou, H. Sandhawalia, L. Amsaleg, and C. Schmid, “Evaluation of gist descriptors for web-scale image search,” in Proceedings of the ACM International Conference on Image and Video Retrieval, CIVR ’09, (New York, NY, USA), pp. 19:1–19:8, ACM, 2009. [15] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The weka data mining software: an update,” SIGKDD Explor. Newsl., vol. 11, pp. 10–18, Nov. 2009. ACKNOWLEDGMENTS The authors would like to thank Josip Krapac for his very helpful input during our work on this paper. This research has been supported by the research projects Research Centre for Advanced Cooperative Systems (EU FP7 #285939) and Vista (EuropeAid/131920/M/ACT/HR). R EFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” Int. J. Comput. Vision, vol. 42, pp. 145–175, May 2001. A. Bosch, X. Muñoz, and R. Martí, “Review: Which is the best way to organize/classify images by content?,” Image Vision Comput., vol. 25, pp. 778–791, June 2007. N. Serrano, A. E. Savakis, and J. Luo, “Improved Scene Classification using Efficient Low-Level Features and Semantic Cues,” Pattern Recognition, vol. 37, no. 9, pp. 1773–1784, 2004. A. Vailaya, M. A. T. Figueiredo, A. K. Jain, and H. J. Zhang, “Image classification for content-based indexing,” IEEE Trans. on Image Processing, vol. 10, pp. 117–130, January 2001. J. Luo, A. E. Savakis, and A. Singhal, “A bayesian network-based framework for semantic image understanding,” Pattern Recogn., vol. 38, pp. 919–934, June 2005. J. Fan, Y. Gao, H. Luo, and G. Xu, “Statistical modeling and conceptualization of natural images,” Pattern Recogn., vol. 38, pp. 865–885, June 2005. F.-F. Li and P. Perona, “A bayesian hierarchical model for learning natural scene categories,” in Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) - Volume 2 - Volume 02, CVPR ’05, (Washington, DC, USA), pp. 524–531, IEEE Computer Society, 2005. A. Bosch, A. Zisserman, and X. Muñoz, “Scene classification via plsa,” in Proceedings of the 9th European conference on Computer Vision Volume Part IV, ECCV’06, (Berlin, Heidelberg), pp. 517–530, SpringerVerlag, 2006. S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2, CVPR ’06, (Washington, DC, USA), pp. 2169–2178, IEEE Computer Society, 2006. A. Oliva and A. B. Torralba, “Scene-centered description from spatial envelope properties,” in Proceedings of the Second International Workshop on Biologically Motivated Computer Vision, BMCV ’02, (London, UK, UK), pp. 263–272, Springer-Verlag, 2002. A. Ess, T. Mueller, H. Grabner, and L. v. Gool, “Segmentationbased urban traffic scene understanding,” in Proceedings of the British Machine Vision Conference, pp. 84.1–84.11, BMVA Press, 2009. doi:10.5244/C.23.84. I. Tang and T. Breckon, “Automatic road environment classification,” IEEE Transactions on Intelligent Transportation Systems, vol. 12, pp. 476–484, June 2011. CCVW 2013 Oral Session 24 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia Computer Vision Systems in Road Vehicles: A Review Kristian Kovačić, Edouard Ivanjko and Hrvoje Gold Department of Intelligent Transportation Systems Faculty of Transport and Traffic Sciences University of Zagreb Email: [email protected], [email protected], [email protected] Abstract—The number of road vehicles significantly increased in recent decades. This trend accompanied a build-up of road infrastructure and development of various control systems to increase road traffic safety, road capacity and travel comfort. In traffic safety significant development has been made and today’s systems more and more include cameras and computer vision methods. Cameras are used as part of the road infrastructure or in vehicles. In this paper a review on computer vision systems in vehicles from the stand point of traffic engineering is given. Safety problems of road vehicles are presented, current state of the art in-vehicle vision systems is described and open problems with future research directions are discussed. I. I NTRODUCTION Recent decades can be characterized by a significant increase of the number of road vehicles accompanied by a buildup of road infrastructure. Simultaneously various traffic control systems have been developed in order to increase road traffic safety, road capacity and travel comfort. However, even as technology significantly advanced, traffic accidents still take a large number of human fatalities and injuries. Statistics show that there is about 1.2 million fatalities and 50 million injuries related to road traffic accidents per year worldwide [1]. Also there were 3 961 pedestrian fatalities in 2005 in the EU. Most accidents happened because drivers were not aware of pedestrians. Furthermore, there has been approximately 500 casualties a year in EU because of driver’s lack of visibility in the blind zone. Data published in [1] estimated that there has been around 244 000 police reported crashes and 224 fatalities related to lane switching in 1991. Statistics published by the Croatian Bureau of Statistics [2] reveal that in 2011 there were 418 killed and 18 065 injured persons who were participating in the traffic of Croatia. To reduce the amount of annual traffic accidents, a large number of different systems has been researched and developed. Such systems are part of road infrastructure or road vehicles (horizontal and vertical signalization, variable message signs, driver support systems, etc.). Vehicle manufactures have implemented systems such as lane detection and lane departure systems, parking assistants, collision avoidance, adaptive cruise control, etc. Main task of these systems is to make the driver aware when leaving the current lane. These systems are partially based on computer vision algorithms that can detect and track pedestrians or surrounding vehicles [1]. Development of computing power and cheap video cameras enabled today’s traffic safety systems to more and more include CCVW 2013 Poster Session cameras and computer vision methods. Cameras are used as part of road infrastructure or in vehicles. They enable monitoring of traffic infrastructure, detection of incident situations, tracking of surrounding vehicles, etc. The goal of this paper is to make a review of existing computer vision systems from the problem stand point of traffic engineering unlike the most reviews done from the problem stand point of computer science. Emphasis is on computer vision methods that can be used in systems build in vehicles to assist the driver and increase traffic safety. This paper is organized as follows. Second section gives a review of road vehicles safety problems followed by the third section that describes vision systems requirements for the use in vehicles. Fourth section gives a review of mostly used image processing methods followed by the fifth section that describes existing systems. Paper ends with a discussion about open problems and conclusion. II. S AFETY PROBLEMS OF ROAD VEHICLES According to the survey study results published in [3], annually about 100 000 crashes in the USA result directly from driver fatigue. It is the main cause of about 30% of all severe traffic accidents. In the case of French statistics, a lack of attention due to driver fatigue or sleepiness was a factor in one in three accidents, while alcohol, drugs or distraction was a factor in one in five accidents in 2003 [4]. Based on the accidents records from [3] most common traffic accidents causes are: • Frontal crashes, where vehicles driving in opposite directions have collided; • Lane departure collisions, where a lane changing vehicle collided with a vehicle from an adjacent lane; • Crashes with surrounding vehicles while parking, passing through a intersection or a narrow alley, etc.; • Failures to see or recognize the road signalization and consequently cause a traffic accident due to inappropriate driving. A. Frontal crashes According to the data published in [3], 30 452 of total 2 041 943 vehicles crashes recorded in the USA in the period from 2005 to 2008 occurred because of too close following 25 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia of a vehicle in a convoy. About 5.5% of accidents happened with vehicles moving in opposite direction. Although there are various types of systems that can help to reduce the number of accidents caused by a frontal vehicle crash, computer vision gives ability to prevent a possible or immediate vehicle crash from happening. System can detect a dangerous situation using one or more frontal cameras and afterward produce appropriate response, haptic or audio. When a possible frontal crash situation occurs, the time available to prevent the crash is usually very short. Efficiency of crash avoidance systems depends on the time in which possible crash has been detected prior to its occurrence. If short amount of time is taken for possible frontal crash detection, then longer amount of time is available for driver warning or evasive actions to be performed. Such actions need to be taken in order to prevent or alleviate the crash (preparation of airbags, autonomous braking, etc.). B. Lane departure Statistics published in [3] denote that 10.8% accidents happened because vehicles overran the lane line and 22.2% because vehicles were over the edge of the road. From all vehicles that participated in accidents, 3.1% were intentionally changing lanes or overtaking another vehicle. Number of accidents can be significantly reduced by using a computer vision system for lane departure warning. If the driver is distracted, such system can simply just accentuate the dangerous action (lane changing) and focus the driver’s attention. When such system is built into a vehicle, drivers also tend to be more careful so that they do not cross the lane markers because they receive some kind of warning otherwise (vibrating driver seat, audio signal or graphical notice). C. Surrounding vehicles Surrounding vehicles represent a threat if the driver is not aware of them. In case of driver fatigue, capability of tracking surrounding vehicles is significantly reduced. This can cause an accident in events such as lane changing, driving in a convoy, parking and similar situations. Depending on complexity of a system that tracks surrounding vehicles, appropriate warnings and actions can be performed based on current vehicle environment state or state that is estimated (predicted) to happen in the near future. System that tracks surrounding vehicles can be used in lane departure warning or parking assistant system also. Main problem of this system are false detections of surrounding vehicles. Vehicles can be in many different shapes and colors. This can cause false vehicle estimations and wrong driver actions endangering other drivers and road users [5]. D. Recognition of road signalization Road signalization (horizontal and vertical) represents the key element in traffic safety. Driver should always detect and recognize both road signalization groups correctly, especially traffic signs. In case when the driver is unfocused or tired, horizontal and vertical signalization is mostly ignored. According to the survey [3], 3.8% of total vehicle crashes happened when drivers did not adjust their driving to road signalization CCVW 2013 Poster Session Fig. 1. General in-vehicle driver support system architecture. recommendation. This accidents number could be reduced if a convenient road signalization detection and driver assistant system had been used. Such system could warn the driver when he is performing a dangerous or illegal maneuver. According to data in [3], there were 109 932 accidents caused by speeding and ignoring speed limits. In some cases drivers are overloaded with information and have a hard time prioritizing information importance. For example, when driving through an area for the first time, drivers are usually more concentrated on route information traffic signs. Systems for road sign detection and recognition, can improve driving safety and reduce above mentioned accidents percentage by drawing the driver attention to warning signs. III. V ISION SYSTEMS REQUIREMENTS To define an appropriate vision system architecture, some application area related requirements need to be considered. Because of the specific field of use, the computer vision system has to work in real time. Obtained data should be available in time after a possible critical situation has been detected. Time window for computation depends on the vehicle speed and shortens with the vehicle speed increase. This can represent a significant problem when small and cheap embedded computers are used. To overcome this problem algorithms that support parallel computing, multi-core processor platforms or computer vision dedicated hardware platforms are used. Alternatively special intelligent video cameras with preinstalled local image processing software can be used. Such cameras can perform extraction of basic image features, object recognition and tracking. Second important requirement is the capability to adapt to rapid changes of environment monitored using cameras. Such changes can be caused by fog, rain, snow, and illumination changes related to night and day or entering and exiting tunnels. In cases where whole driver support or vehicle control systems use more then one sensor type (eg. optical cameras and ultrasonic sensors, lidars), system adaptiveness to environment changes is better then in cases with one sensor type. Furthermore, whole system needs to be resistant to various physical influences (e.g. vibrations, acceleration/deceleration) [5]. In Fig. 1 general in-vehicle vision system architecture is given. To detect and track vehicles or other kind of objects (pedestrians, lane markings, traffic signs, etc.) often different sensors are used. Sensors can be divided into two categories: active and passive. Most active sensors are time of flight sensors, i.e. they measure the distance between a specific object and the sensor. Millimeter-wave radars and laser sensors are 26 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia Fig. 2. Same image in visible spectrum (a) and in far-infrared spectrum (b). mostly used as active sensors in automotive industry. Active sensors are so mainly used for surrounding vehicles detection and tracking. Various cameras are mostly used as passive sensors. Basic camera characteristics include their working spectrum, resolution, field of view, etc. In traffic application camera spectrum is crucial since night-time cameras use far, mid and near infrared spectrum, while daylight cameras use visible spectrum. Far infrared spectrum camera is suitable for detection of pedestrians at night as shown in Fig. 2. Computer vision systems in vehicles can be based on stereo vision also. Stereo vision is a technique that enables extraction of 3D data from a pair of 2D flat images. In case of a 2D image pair, usually whole overlapping image or only a specific region of it is translated from a 2D to 3D coordinate system. IV. I MAGE PROCESSING APPROACHES Image processing consists of basic operations that provides features for high level information extraction. Simplest image processing transformations are done by point operators. Such operators can perform color transformation, compositing and matting (process of separating the image into two layers, foreground and background), histogram equalization, tonal adjustment and other operations. Point operators do not use surrounding points in calculations. Contrariwise, neighborhood operators (linear filtering, non-linear filtering, etc.) use information from surrounding pixels also. For image enhancement, linear and non-linear filters (moving-average, Bartlett filter, median filter, morphology, etc.) are used. Non-linear filters can be used for shape recognition methods also. For example, as part of vehicle recognition methods [6]. Affine transformation is a type of transformation that preserves symmetries (parallelism) and distance ratios when applied on certain points or parts of an image. As a consequence of this, specific geometric shape on an image can be rotated, scaled, translated, etc. This transformation is usually used for basic operations with a whole specific image region. Distance estimation between vehicles is important in traffic applications. To calculate the distance between the camera and a point in space represented by a specific pixel in the image, perspective (projection) transformation is performed. It is computed by applying a perspective matrix transformation on every pixel in the image. Perspective transformation assigns a third component (z coordinate) to every pixel depending on the x-y components in the image. In Fig. 3, result of perspective image transformation is given. Pixel components in the original image are x and y coordinates. After perspective transformation a new pixel component (z coordinate) is calculated and swapped with the y coordinate. So, pixels near the image bottom of the image represent points in 3D space that are closer to the camera and pixels near the image top are further away. CCVW 2013 Poster Session Fig. 3. Original image (left) and same image with perspective projection transformation applied (right) [7]. Hough transformation is used to extract parameters of image features (lines). Main advantage of this method is its relatively high tolerance to noise and ability to ignore small gaps on shapes. After required processing (edge detection) is done on an image, Hough transformation translates every pixel of a feature in the image from the Cartesian to the polar coordinate system. Original Hough transform method is used for extracting only line parameters although various modified versions of Hough transform have been developed to extract corner, circle and ellipse parameters. For object detection Paul Viola and Michael Jones proposed an approach called ’Haar-like features’ that has some resemblance to Haar-wavelet functions [8]. Haar-like-features methods are used where fast object detection is required [9], [10]. This approach is based on dividing a detection window (window of interest) into smaller regions. Region size depends on the size of an object that needs to be detected. For each region a sum is computed based on pixel intensity within the processed region. Summed area table (integral image) is used to perform a fast summarization. Possible object candidates can be found comparing differences between sums of adjacent regions. For a more accurate object detection, many regions need to be tested. Applications such as vehicle detection require use of more complex methods that have computational demands similar to the Haar-like-features method. One of the algorithms used for simplification of extracted shapes is the Ramer-Douglas-Peucker algorithm [11]. It reduces the number of points in a curve that is approximated by a series of points. Complex curves with many points can be too computationally (time) demanding for processing. Simplifying the curve with this algorithm allows faster and simpler processing. V. E XISTING ROAD VEHICLE COMPUTER VISION SYSTEMS Since the first intelligent vehicle was introduced in the mid 1970s [12], computer vision systems also started to develop as one of its main sensor systems. Currently there are many projects that are researching and developing computer vision systems in road vehicles. Project goals are related to development of driver support systems, driver state recognition and road infrastructure recognition modules for autonomous vehicles. Computer vision systems in vehicles are part of autonomous vehicle parking systems, adaptive cruise control, lane departure warning, driver fatigue detection, obstacle and traffic sign detection, and others. Currently only high class vehicles are equipped with several such systems but they are being more and more included in middle class vehicles. 27 Proceedings of the Croatian Computer Vision Workshop, Year 1 A. Lane detection and departure warning Lane detection system represents not only a useful driver aid, but an important component in autonomous vehicles also. To resolve lane detection and departure warning problems, key element is to detect road lane boundaries (horizontal road signalization and width of the whole road). In work described in [13], lane detection is done in two parts. First part includes image enhancement and edge detection. Second part deals with lane features extraction and road shape estimation from the processed image. In Fig. 4, more detailed flow diagram of this algorithm is shown. Authors of [14] developed an algorithm that handles extraction of lane features based on estimation methods. Besides determining shape of main (current) lane, algorithm has the ability to estimate position and shape of other surrounding lanes. Applied lane detection principle is based on two methods. On a straight lane, Hough and simplified perspective transformations mentioned in previous section are used first. These transformations enable detection of the center lane. To detect surrounding lanes, further processing needs to be done. Required processing includes affine transformations and other calculations that retrieve relative vehicle position. If the lane is a curve with a large radius, this method can be used also. In second method for curved lane with small radius, edge features are extracted from the original image first. After extracting these features, complete perspective transformation is performed based on camera’s installation parameters. By recognizing points on the center lane, curve estimation is performed. Positions of other adjacent lanes are then calculated based on known current lane curve parameters. Problems with this approach are: (i) incapability to detect lane markers due the poor visibility or object interference (infront vehicles or other objects block the camera’s field of view); and (ii) error in estimating current vehicle position due to the curvature of the road [14]. B. Driver fatigue detection The process of falling asleep at the steering wheel can be characterized by a gradual decline in alertness from a normal September 19, 2013, Zagreb, Croatia state due to monotonous driving conditions or other environmental factors. This diminished alertness leads to a state of fuzzy consciousness followed by the onset of sleep [15]. If drivers in fuzzy consciousness would get a warning one second before possible accident situation, about 90% of accidents could be prevented [9]. Various techniques have been developed to help to reduce the number of accident caused by driver fatigue. These driver fatigue detection techniques can be divided into three groups: (i) methods that analyze driver’s current state related to eyelid and head movement, gaze and other facial expression; (ii) methods based on driver performance, with a focus on the vehicle’s behavior including position and headway; and (iii) methods based on combination of the driver’s current state and driver performance [15]. In work described in [9], two CCD cameras are used. First camera is fixed and it is used to track the driver’s head position. Camera output is an image in low-resolution like 320x240, which allows the use of fast image processing algorithms. As the driver’s head position is known, processing of second camera image can focus only on that region. Image of second camera is high-resolution one (typically 720x576) which raises level of detail of the driver’s head. This image is used in further analysis for determining driver’s performance from mouth position, eye lid opening frequency and diameter of opened eye lid [9]. It is visible from experimental results that at least 30 frames (high-resolution images) can be processed per second. Such frame rate represents good basis for further research and development of driver fatigue detection systems. In review [15], an approach to driver fatigue detection that first detects regions of eyes and afterwards extracts information about driver fatigue from this region is briefly described. To achieve this, driver’s face needs to be found first. After finding the face, eyes locations are estimated by finding the darkest pixels on face. When eye locations are known, software tries to extract further information for driver fatigue detection. In order to test the algorithm different test drives were made. If the driver’s head direction was straight into the camera none false detection were observed. After head is turned for 30 or more degrees, system starts to produce false detections. Beside described techniques, processing additional features like blinking rate of driver’s eyes or position of eye’s pupil are used to improve driver fatigue detection [9]. C. Vehicle detection There are several sensors (active and passive) that can be used for vehicle detection. Lot of research was done regarding vehicle detection using optical cameras and this still presents a challenge. Main problem of video image processing based vehicle detection is significant variability in vehicle and environment appearance (vehicle size and shape, color, sunlight, snow, rain, dust, fog, etc.). Other problems arise from the need for system robustness and requirements for fast processing [7]. Fig. 4. Flowchart of image processing for lane detection [13]. CCVW 2013 Poster Session Most algorithms for vehicle detection in computer vision systems are based on two steps: (i) finding all candidates in an image that could be vehicles; and (ii) performing tests that can verify the presence of a vehicle. Finding all vehicle candidates in an image is mostly done by three types of methods: (i) knowledge-based methods; (ii) stereo-vision methods; and (iii) motion-based methods. 28 Proceedings of the Croatian Computer Vision Workshop, Year 1 Knowledge-based methods work on the presumption of knowing specific parameters of the object (vehicle) being detected (vehicle position, shape, color, texture, shadow, lighting, etc.). Symmetry can be used as one of the most important vehicle specifications. Every vehicle is almost completely symmetrical viewed from front and rear side and this is used in vehicle detection algorithms. Besides symmetry, color of the road or lane markers can be used in vehicle detection also. Using color, moving objects (vehicles) can be segmented from the background (road). Other knowledge-based methods use vehicle shadow, corners (vehicle rectangular shape), vertical/horizontal edges, texture, vehicle lights, etc. [7]. Stereo-vision methods are based on disparity maps or inverse perspective mapping. In disparity maps, position of pixels on both cameras that represents the same object are compared. Third component of pixel vector (z coordinate) can be calculated by comparing changes in pixels position. Inverse perspective mapping method is based on transforming the image and removing the perspective effect from it. In stereovision systems mostly two cameras are used although systems with more than two cameras are more accurate, but more computationally expensive also. In [16], a computer vision system with three cameras is described. It can detect objects that are 9 [cm] in size and up to 110 [m] in distance. Such a system has a high level of accuracy when compared to a system with two cameras. Motion-based methods use optical flow for image processing. Optical flow in computer vision represents the pattern of visible movement of objects, surfaces and edges in the camera field of view. Optical flow calculated from an image is shown in Fig. 5, where arrow’s length and direction represents average speed and direction (optical flow) of subimages. Algorithms for this method divide the image into smaller parts. Smaller image parts are analyzed with previously saved images in order to calculated its relative direction and speed. With this method static and dynamic image parts can be separated and additionally processed [17]. Once optical flow of sub-images is calculated, moving objects (vehicles) candidates can be recognized. A verification method decides which candidates will be referenced as real objects. Verification is done by calibrating a speed threshold on the most efficient value. With this filtering, false detections are reduced. For further verification, methods that are also used in knowledge based vehicle detection methods (appearance and template based) are used. These kind of methods recognize Fig. 5. Original image (a) and computed optical flow (b) [7]. CCVW 2013 Poster Session September 19, 2013, Zagreb, Croatia vehicle features like edges, texture, shadow, etc. D. Real time traffic sign detection Traffic sign detection is performed in two phases. First phase is usually done by color segmentation. Since traffic signs have edges of a specific color, color segmentation represents a good base for easier extraction of traffic signs from image background. In [10], color-based segmentation is done in two steps: color quantization followed by region of interest analysis. For color model, RGB color space is usually used, although YIQ, YUV, L*a*b and CIE can be used also. After color segmentation, shape-based segmentation is performed for final detection of circles, ellipses and triangles. Second phase is traffic sign classification where various methods like template matching, linear discriminant analysis, support vector machine (SVM), artificial neural networks, and other machine learning methods can be used [10]. For traffic sign detection AdaBoost [18] and SURF [19] algorithms are used also. Traffic sign detection approach described in [11] performs recognition in two main steps. First step is related to a traffic sign detection algorithm whose goal is to localize a traffic sign in the image with minimum noise. For detection, image preprocessing with color-based method threshold is used first. This method is fast and often suitable for use in real-time systems. After preprocessing, traffic sign approximation is performed with Ramer-Douglas-Peucker algorithm. On localized traffic sign image, preprocessing is performed and image is converted to binary image with resolution of 30x30 pixels. Second step represents the traffic sign recognition based on SVM, often C-SVM with linear kernel. The developed traffic sign detection and recognition system described in [20] is based on a video sequence analysis rather than a single image analysis. Authors use a cascade of boosted classifiers of Haar-like-features (usually pixel intensity value) method to detect traffic signs. OpenCV framework is used to alleviate the implementation. Main disadvantages of traffic sign detection based on a Haar-like-features method boosted classifier cascade are: (i) poor precision compared to the required precision; and (ii) poor traffic sign localization accuracy where its true location (position in the image) is in most cases deviated. Framework for detecting, tracking and recognizing traffic signs proposed in [20] is shown in Fig. 6. Fig. 6. Flowchart of sign detection and recognition algorithm [20]. 29 Proceedings of the Croatian Computer Vision Workshop, Year 1 Since traffic sign detection is performed on a sequence of images obtained from a moving vehicle, a region trajectory where most probably the signs are located can be constructed. Detection system learns on each detection. If a significant number of learning samples (traffic sign images) is used, system can achieve precision of up to 54% with recall of 96%. VI. September 19, 2013, Zagreb, Croatia supported by the project VISTA financed from the Innovation Investment Fund Grant Scheme and EU COST action TU1102. R EFERENCES [1] O PEN PROBLEMS Main constrain of computer vision systems in vehicles are their limitations due to hardware specifications. Processing an image through a large number of consecutive methods for data recognition and extraction can be very CPU demanding. Trade off between quality and cost of every step in image processing should be done in order to make a system maximally optimized. In image processing, two most widely used approaches are basic image features extraction using Hough transformation and learning classificators for whole object recognition. Implementation of classificators like the one based on Haar-like-features methods allows algorithms to use less system resources. Drawback of this method is its low ratio of accuracy. Contrary to this, Hough method is much more precise, but uses more system resources. Methods that perform complex calculations on all image pixels should be avoided and used only where they are necessary. [2] One of the factors that affects vehicle detection accuracy, tracking and recognition of objects in an outdoor environment are environmental conditions. Weather conditions such as snow, rain, fog and others can significantly reduce computer vision system accuracy. Poor quality of horizontal road signalization and omitted field of view due to various obstacles (often problem with traffic signs) presents an additional problem to vision based systems also. Significant robustness is required to overcome this problem. To reduce the influence that these factors can have on the system, new cameras and appropriate image processing algorithms still need to be developed. [9] VII. C ONCLUSION Computer vision software used for vehicle detection and obtained recognition consist of many subsystems. Incoming images from one or several cameras need to be preprocessed in the desired format. Obtained image is then further processed for high level object detection and recognition. Detecting an object in a scene can be done by multiple methods (eg. Hough and Haar-like-features methods). Use of this methods alone is not enough to make the system reliable for object detection and recognition. In cases of bad environment conditions, image can be full of various noise (snow, fog and rain on the image). Besides environment conditions, high vehicle velocity can also affects the image (blur effect). Current systems that are used in more expensive vehicles are part of driver support systems based on simplified image processing. Examples are lane detection and traffic signs for speed limitations. Still problems related to robustness to different outdoor environment conditions and real time processing of high level features recognition exist. ACKNOWLEDGMENT Authors wish to thank Nikola Bakarić for his valuable comments during writing this paper. This research has been CCVW 2013 Poster Session [3] [4] [5] [6] [7] [8] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] C. Hughes, R. O’Malley, D. O’Cualain, M. Glavin, and E. Jones, New Trends and Developments in Automotive System Engineering. Intech, 2011, ch. Trends towards automotive electronic vision systems for mitigation of accidents in safety critical situations, pp. 493–512. E. Omerzo, S. Kos, A. Veledar, K. Šakić Pokrivač, L. Pilat, M. Pavišić, A. Belošević, and Š. V. Njegovan, Deaths in Traffic Accidents, 2011. Croatian Bureau of Statistics, 2012. N. H. T. S. Administration, “National motor vehicle crash causation survey,” U.S. Department of Transportation, Tech. Rep., Apr 2008. S. Ribarić, J. Lovrenčić, and N. Pavešić, “A neural-network-based system for monitoring driver fatigue,” in MELECON 2010 15th IEEE Mediterranean Electrotechnical Conference, 2010, pp. 1356–1361. M. Bertozzi, A. Broggi, M. Cellario, A. Fascioli, P. Lombardi, and M. Porta, “Artificial vision in road vehicles,” in Proceedings of the IEEE, vol. 90, no. 7, 2002, pp. 1258–1271. R. Szeliski, Computer Vision: Algorithms and Applications. Springer - Verlag London Limited, 2011. Z. Sun, G. Bebis, and R. Miller, “On-road vehicle detection: A review,” Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 5, pp. 694–711, May 2006. P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Computer Vision and Pattern Recognition, vol. 1, 2001, pp. I–511–I–518 vol.1. L. Lingling, C. Yangzhou, and L. Zhenlong, “Yawning detection for monitoring driver fatigue based on two cameras,” in Proceedings of the 12th International IEEE Conference on Intellligent Transportation Systems, St. Louis, Missouri, USA, 4-7 Oct 2009, pp. 12–17. C. Long, L. Qingquan, L. Ming, and M. Qingzhou, “Traffic sign detection and recognition for intelligent vehicle,” in Proceedings of IEEE Intelligent Vehicles Symposium, 2011, pp. 908–913. D. Soendoro and I. Supriana, “Traffic sign recognition with color-based method, shape-arc estimation and svm,” in Proceedings of International Conference on Electrical Engineering and Informatics, 2011. S. Tsugawa, “Vision-based vehicles in japan: the machine vision systems and driving control systems,” Industrial Electronics, IEEE Transactions on, vol. 41, no. 4, pp. 398–305, August 1994. T. Quoc-Bao and L. Byung-Ryong, “New lane detection algorithm for autonomous vehicles using computer vision,” in Proceedings of International Conference on Control, Automation and Systems, ICCAS 2008., Seoul, Korea, 14-17 Oct 2008, pp. 1208–1213. Y. Jiang, F. Gao, and G. Xu, “Computer vision-based multiple-lane detection on straight road and in a curve,” in Proceedings of International Conference on Image Analysis and Signal Processing, Huaqiao, China, 9-11 April 2010, pp. 114–117. W. Qiong, W. Huan, Z. Chunxia, and Y. Jingyu, “Driver fatigue detection technology in active safety systems,” in Proceedings of the 2011 International Conference RSETE, 2011, pp. 3097–3100. T. Williamson and C. Thorpe, “A trinocular stereo system for highway obstacle detection,” in 1999 International Conference on Robotics and Automation (ICRA ’99), 1999. M. Bertozzi, A. Broggi, A. Fascioli, and R. Fascioli, “Stereo inverse perspective mapping: Theory and applications,” Image and Vision Computing Journal, vol. 8, pp. 585–590, 1998. X. Baro, S. Escalera, J. Vitria, O. Pujol, and P. Radeva, “Traffic sign recognition using evolutionary adaboost detection and forest-ecoc classification,” Intelligent Transportation Systems, IEEE Transactions on, vol. 10, no. 1, pp. 113–126, 2009. D. Ding, J. Yoon, and C. Lee, “Traffic sign detection and identification using surf algorithm and gpgpu,” in SoC Design Conference (ISOCC), 2012 International, 2012, pp. 506–508. S. Šegvić, K. Brkić, Z. Kalafatić, and A. Pinz, “Exploiting temporal and spatial constraints in traffic sign detection from a moving vehicle,” Machine Vision and Applications, pp. 1–17, 2011. [Online]. Available: http://dx.doi.org/10.1007/s00138-011-0396-y 30 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia Global Localization Based on 3D Planar Surface Segments Detected by a 3D Camera Robert Cupec, Emmanuel Karlo Nyarko, Damir Filko, Andrej Kitanov, Ivan Petrović Faculty of Electrical Engineering J. J. Strossmayer University of Osijek Osijek, Croatia [email protected] Faculty of Electrical Engineering and Computing University of Zagreb Zagreb, Croatia [email protected] Abstract—Global localization of a mobile robot using planar surface segments extracted from depth images is considered. The robot’s environment is represented by a topological map consisting of local models, each representing a particular location modeled by a set of planar surface segments. The discussed localization approach segments a depth image acquired by a 3D camera into planar surface segments which are then matched to model surface segments. The robot pose is estimated by the Extended Kalman Filter using surface segment pairs as measurements. The reliability and accuracy of the considered approach are experimentally evaluated using a mobile robot equipped by a Microsoft Kinect sensor. Keywords—global localization; Extended Kalman Filter I. planar surfaces; Kinect; INTRODUCTION The ability of determining its location is vital to any mobile machine which is expected to execute tasks which include autonomous navigation in a particular environment. The basic robot localization problem can be defined as determining the robot pose relative to a reference coordinate system defined in its environment. This problem can be divided into two sub-tasks: initial global localization and local pose tracking. Global localization is the ability to determine the robot's pose in an a-priori or previously learned map, given no other information than that the robot is somewhere on the map. Local pose tracking, on the other hand, compensates small, incremental errors in a robot’s odometry given the initial robot’s pose thereby maintaining the robot localized over time. In this paper, global localization is considered. There are two main classes of vision-based global localization approaches, appearance-based approaches and feature-based approaches. In appearance-based approaches, each location in a robot's operating environment is represented by a camera image. Robot localization is performed by matching descriptors assigned to each of these images to the descriptor computed from the current camera image. The location corresponding to the image which is most similar to the currently acquired image according to a particular descriptor similarity measure CCVW 2013 Poster Session is returned by the localization algorithm as the solution. The appearance-based techniques have recently been very intensively explored and some impressive results have been reported [1], [2]. In feature-based approaches, the environment is modeled by a set of 3D geometric features such as point clouds [3], points with assigned local descriptors [4], line segments [5], [6], surface patches [7] or planar surface segments [8], [9], [10], where all features have their pose relative to a local or a global coordinate system defined. Localization is performed by searching for a set of model features with similar properties and geometric arrangement to that of the set of features currently detected by the applied sensor. The robot pose is then obtained by registration of these two feature sets, i.e. by determining the rigid body transformation which maps one feature set onto the other. An advantage of the methods based on registration of sets of geometric features over the appearance-based techniques is that they provide accurately estimated robot pose relative to its environment which can be directly used for visual odometry or by a SLAM system. In this paper, we consider a feature-based approach which relies on an active 3D perception sensor. An advantage of using an active 3D sensor in comparison to the systems which rely on a 'standard' RGB camera is their lower sensitivity to changes in lighting conditions. A common approach for registration of 3D point clouds obtained by 3D cameras is Iterative Closest Point (ICP) [11], [12], [13]. Since this method requires a relatively accurate initial pose estimate, it can be used for local pose tracking and visual odometry, but it is not appropriate for global localization. Furthermore, ICP is not suitable for applications where significant changes in the scene are expected. Hence, we use a multi hypothesis Extended Kalman Filter (EKF). Localization methods based on registration of 3D planar surface segments extracted from depth images obtained by a 3D sensor are proposed in [10] and [14]. In [14], a highly efficient method for registration of planar surface segments is 31 Proceedings of the Croatian Computer Vision Workshop, Year 1 proposed and its application for pose tracking is considered. In this paper, the approach proposed in [14] is adapted for global localization and its performance is analyzed. The environment model which is used for localization is a topological map consisting of local metric models. Each local model consists of planar surface segments represented in the local model reference frame. Such a map can be obtained by driving a robot with a camera mounted on it along a path the robot would follow while executing its regular tasks. The rest of the paper is structured as follows. In Section II, the global localization problem is defined and a method for registration of planar surface segments is described which can be used for global localization. An experimental analysis of the proposed approach is given in Section III. Finally, the paper is concluded with Section IV. II. REGISTRATION OF PLANAR SURFACE SEGMENT SETS The global localization problem considered in this paper can be formulated as follows. Given an environment map consisting of local models M1, M2, ..., MN representing particular locations in the considered environment together with spatial relations between them and a camera image acquired somewhere in this environment, the goal is to identify the camera pose at which the image is acquired. The term 'image' here denotes a depth image or a point cloud acquired by a 3D camera such as the Microsoft Kinect sensor. Let SM,i be the reference frame assigned to a local model Mi. The localization method described in this section returns the index i of the local model Mi representing the current robot location together with the pose of the camera reference frame SC relative to SM,i. The camera pose can be represented by T vector w = φT t T , where φ is a 3-component vector describing the orientation and t is a 3-component vector describing the position of SC relative to SM,i. Throughout the paper, symbol R(φ φ) is used to denote the rotation matrix corresponding to the orientation vector φ. September 19, 2013, Zagreb, Croatia Mi in the map, where the initial pose estimate is set to the zero vector with a high uncertainty of the position and orientation. The proposed algorithm returns the pose hypothesis with the highest consensus measure [14] as the final result. A. Detection and Representation of Planar Surface Segments Depth images acquired by a 3D camera are segmented into sets of 3D points representing approximately planar surface segments using a similar split-and-merge algorithm as in [15], which consists of an iterative Delaunay triangulation method followed by region merging. Instead of a region growing approach used in the merging stage of the algorithm proposed in [15], we applied a hierarchical approach proposed in [16] which produces less fragmented surfaces while keeping relevant details. By combining these two approaches a fast detection of dominant planar surfaces is achieved. The result is a segmentation of a depth image into connected sets of approximately coplanar 3D points each representing a segment of a surface in the scene captured by the camera. An example of image segmentation to planar surface segments is shown in Fig. 1. The parameters of the plane supporting a surface segment are determined by least-square fitting of a plane to the supporting points of the segment. Each surface segment is assigned a reference frame with the origin in the centroid tF of the supporting point set and z-axis parallel to the supporting plane normal. The orientation of x and y-axis are defined by the eigenvectors of the covariance matrix Σp computed from the positions of the supporting points of the considered surface segment within its supporting plane. The purpose of assigning reference frames to surface segments is to provide a framework for surface segment matching and EKF-based pose estimation explained in Section Error! Reference source not The basic structure of the proposed approach is the standard feature-based localization scheme consisting of the following steps: 1. 2. 3. 4. feature detection, feature matching, hypothesis generation, selection of the best hypothesis. (a) (b) (c) (d) Features used by the considered approach are planar surface segments obtained by segmentation of a depth image. These features are common in indoor scenes, thus making our approach particularly suited for this type of environments. The surface registration algorithm considered in this paper is basically the same as the one proposed in [14]. The only difference is that instead of implementing visual odometry by registration between the currently acquired image and the previous image, global localization is achieved by registration between the currently acquired image and every local model CCVW 2013 Poster Session Fig. 1. An example of image segmentation to planar surface segments: (a) RGB image; (b) depth image obtained by Kinect, where darker pixels represent points closer to the camera, while black points represent points of undefined depth; (c) extracted planar surface segments delineated by green lines and (d) 3D model consisting of dominant planar surface segments. 32 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia variable r is computed as the uncertainty of the surface segment centroid tF in the direction of the segment normal n, computed by found.. Let the true plane be defined by the equation n ⋅ F p = Fρ , F T (1) σ2r = nT ⋅ ΣC ( t F ) ⋅ n . (4) F where n is the unit normal of the plane represented in the surface segment reference frame SF, Fρ is the distance of the plane from the origin of SF and Fp is an arbitrary point represented in SF. In an ideal case, where the measured plane is identical to the true plane, the true plane normal is identical to the z-axis of SF, which means that Fn = [0, 0, 1]T, while F ρ = 0. In a general case, however, the true plane normal deviates from the z-axis of SF and this deviation is described by the random variables sx and sy, representing the deviation in directions of the x and y-axis of SF respectively, as illustrated in Fig. 2 for x direction. The unit normal vector of the true plane can then be written as The variances σ2sx and σ 2sy describing the uncertainty of the segment plane normal are estimated using a simple model. The considered surface segment is represented by a flat 3D ellipsoid centered in the segment reference frame SF, as illustrated in Fig. 3. Assuming that the orientation of the surface segment is computed from four points at the ellipse perimeter which lie on the axes x and y of SF, as illustrated in Fig. 3, the uncertainty of the surface segment normal can be computed from the position uncertainties of these four points. According to this model the variances σ2sx and σ 2sy can be computed by F n= 1 s + s +1 2 x 2 y sx s y 1 T (2) σ2sx ≈ Furthermore, let the random variable r represent the distance of the true plane from the origin of SF, i. e. ρ =r. F (3) The uncertainty of the supporting plane parameters can be described by the disturbance vector q = [sx, sy, r]T. We use a Gaussian uncertainty model, where the disturbance vector q is assumed to be normally distributed with 0 mean and covariance matrix Σq. Covariance matrix Σq is a diagonal matrix with variances σ2sx , σ 2sy and σ2r on its diagonal describing the uncertainties of the components sx, sy and r respectively. These variances are computed from the uncertainties of the supporting point positions, which are determined using a triangulation uncertainty model analogous to the one proposed in [17]. Let this uncertainty be represented by the 3×3 covariance matrix ΣC(p) assigned to each point position vector p obtained by a 3D camera. In order to achieve high computational efficiency, the centroid tF of a surface segment is used as the representative supporting point and it is assumed that the uncertainties of all supporting points of this surface segment are similar to the centroid uncertainty. The variance σ2r describing the uncertainty of the disturbance σr2 , λ1 + σ2r σ2sy ≈ σr2 , λ 2 + σ2r (5) where λ1 and λ2 are the two largest eigenvalues of the covariance matrix Σp. Alternatively, a more elaborate uncertainty model can be used, such as those proposed in [6] and [10]. Finally, a scene surface segment is denoted in the following by the symbol F associated with the quadruplet F = ( CRF , C t F , Σ q , Σ p ) , (6) where CRF and CtF are respectively the rotation matrix and translation vector defining the pose of SF relative to the camera coordinate system SC. Analogously, a local model surface segment is represented by F ′ = ( MRF ′ , M t F ′ , Σ q′ , Σ p′ ) , (7) where index M denotes the local model reference frame. B. Initial Feature Matching The pose estimation process starts by forming a set of surface segment pairs (F, F’), where F is a planar surface ZF sx ZF λ2 r SF XF Fig. 2. Displacement of the true plane from the measured plane. CCVW 2013 Poster Session SF σr XF λ1 YF Fig. 3. Plane uncertainty model. 33 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia segment detected in the currently acquired image and F’ is a local model surface segment. A pair (F, F’) represents a correct correspondence if both F and F’ represent the same surface in the robot’s environment. The surface segments detected in the currently acquired image are transformed into the local model reference frame using an initial estimate of the camera pose relative to this frame and every possible pair (F, F’) of surface segments is evaluated according to the coplanarity and overlap criteria explained in [14]. These two criteria take into account the uncertainty of the assumed camera pose. In the experiments reported in this paper, the initial robot pose estimate used for feature matching is set to zero vector with the uncertainty described by a diagonal covariance matrix Σw,match = diag([ σ φ2 , σ t2 , σ t2 ]), where σα = 20° and σt = 1 m describe the uncertainty of the robot orientation and position in xy-plane of the world reference frame respectively. This uncertainty is propagated to the camera reference frame, taking into account the uncertainty of the camera inclination due to the uneven floor. The deviation of the floor surface from a perfectly flat surface is modeled by zero-mean Gaussian noise with standard deviation σf = 5 mm. C. Hypothesis Generation Given a sequence of correct pairs (F, F’), the camera pose relative to a local model reference frame can be computed using the EKF approach. Starting from the initial pose estimate, the pose information is corrected using the information provided by each pair (F, F’) in the sequence. After each correction step, the pose uncertainty is reduced. This general approach is applied e.g. in [18] and [5]. Some specific details related to our implementation are given in the following. Let (F, F’) be a pair of corresponding planar surface segments. Given a vector F'p representing the position of a point relative to SF', the same point is represented in SF by F ( ) p = C RFT RT ( φ ) ( M RF ′ F ′p + M t F ′ − t ) − C t F , where w = φT (8) T t T is an estimated camera pose relative to a local model reference frame. By substituting (8) into (1) we obtain F′ T n ⋅ F ′p = Fρ′ between the plane normals and their distances from the origin of SF'. Assuming that F and F' represent the same surface, the following equations hold F′ F′ n= F′ M ρ = Fρ + F nT ⋅ C RFT R R ( φ ) RF ( T F′ C C ρ = Fρ′ ′ , (13) where F n′ ′ and Fρ′ ′ are the parameters of the plane supporting F' represented in reference frame SF'. Since F n′ and F n′ ′ are unit vectors with two degrees of freedom, it is appropriate to compare only their two components. We choose the first two components to formulate the coplanarity constraint 1 0 0 F ′ ′ ′) n − Fn ( 0 1 0 =0 F′ F′ ρ − ρ′ (14) Note that the vector on the left side of equation (14) is actually a function of the disturbance vectors q and q' representing the uncertainty of the parameters of the planes supporting F and F' respectively, the pose w and the estimated poses of F and F' relative to SC and SM respectively. Hence, (14) can be written as (15) h ( F , F ′, w , q, q′ ) = 0 . Equation (15) represents the measurement equation from which EKF pose update equations can be formulated using the general approach described in [18]. This EKF-based procedure will give a correct pose estimate assuming that the sequence of surface segment pairs used in the procedure represent correct correspondences. Since the initial correspondence set usually contains many false pairs, a number of pose hypotheses are generated from different pair sequences and the most probable one is selected as the final solution. We use the efficient hypothesis generation method described in [14]. The result of the described pose estimation procedure is a set of pose hypotheses ranked according to a consensus measure explained in [14]. Each hypothesis consists of the index of a local model to which the current depth image is matched and the camera pose relative to the reference frame of this model. (9) F n, (10) ) t F + RT ( φ ) ( t − M t F ′ ) . (11) Vector F n′ and value Fρ′ are the normal of F represented in SF' and the distance of the plane supporting F from the origin of SF' respectively. The deviation of the plane supporting the scene surface segment from the plane containing the local model surface segment can be described by the difference CCVW 2013 Poster Session (12) F′ III. where F′ n′ , n= EXPERIMENTAL EVALUATION In this section, the results of an experimental evaluation of the proposed approach are reported. We implemented our system in C++ programming language using OpenCV library [19] and executed it on a 3.40 GHz Intel Pentium 4 Dual Core CPU with 2GB of RAM. The algorithm is experimentally evaluated using 3D data provided by a Microsoft Kinect sensor mounted on a wheeled mobile robot Pioneer 3DX also equipped with a laser range finder SICK LMS-200. For the purpose of this experiment, two datasets were generated by manually driving the mobile robot on two different occasions 34 Proceedings of the Croatian Computer Vision Workshop, Year 1 Fig. 4. The map of the Department of Control and Computer Engineering, (FER, Zagreb) obtained using SLAM and data from a laser range finder and the trajectory of the wheeled mobile robot while generating images used in creating the topological map. through a section of a previously mapped indoor environment of the Department of Control and Computer Engineering, Faculty of Electrical Engineering and Computing (FER), University of Zagreb. The original depth images of resolution 640 × 480 were subsampled to 320 × 240. Fig. 4 shows the previously mapped indoor environment generated with the aid of SLAM using data from the laser range finder together with the trajectory of the robot when generating the first dataset. The first dataset consists of 444 RGB-D images recorded along with the odometry data. The corresponding ground truth data as to the exact pose of the robot in the global coordinate frame of the map was determined using laser data and Monte Carlo localization. These images were used to create the environment model – a database of local metric models with topological links. This environment model or topological map consisted of 142 local models, generated such that the local model of the first image was automatically added to the map and every consecutive image or local model added to the map satisfied at least one of the following conditions: (1) the translational distance between the candidate image and the latest added local model in the map was at least 0.5 m or (2) the difference in orientation between the candidate image and the latest added local model in the map was at least 15°. September 19, 2013, Zagreb, Croatia as the environment model. Thus, all 792 images were tested, among which 142 were used for model building. As explained in Section II, each generated hypothesis provides the index of the local model from the topological map as well as the relative robot pose corresponding to the test image with respect to the local model. This robot pose is referred to herein as a calculated pose. By comparing the calculated pose of the test image to the corresponding ground truth data, the accuracy of the proposed approach can be determined. For each test image, a correct hypothesis is considered to be one where: (1) the translational distance between the calculated pose and the ground truth data is at most 0.2m and; (2) the absolute difference in orientation between the calculated pose and the ground truth is at most 2°. Using this criterion, the effectiveness of the proposed global localization method can be assessed not only on the basis of the accuracy of the solutions, but also on the minimum number of required hypotheses that need to be generated in order to obtain at least one correct hypothesis. Examples of images from both datasets are given in Fig. 5. An overview of the results of the initial global localization experiment is given in Table I. Of the 792 test images, the proposed approach was not able to generate any hypothesis in 30 cases. In all 30 cases, the scenes were deficient in information needed to estimate all six degrees of freedom (DoF) of the robot's pose. Such situations normally arise when the camera is directed towards a wall or door at a close distance, e.g. while the robot is turning around in a corridor. TABLE I. GLOBAL LOCALIZATION RESULTS Total No hypothesis No correct hypothesis Correct hypothesis Number of images 792 30 44 718 Percentage (%) 100.00 3.79 5.56 90.65 In 44 cases, no correct hypothesis was generated by the proposed approach. There were two main reasons for such a situation: (1) the topological map did not contain a local model covering the scene of the test image; (2) the existence of repetitive structures in the indoor environment. An example of such situations is a pair of scenes shown in the last column The trajectory of the robot during the generation of the second sequence was not the same as the first sequence but covered approximately the same area. With the aid of odometry information from the robot encoders, the second sequence was generated by recording RGB-D images every 0.5 m or 5° difference in orientation between consecutive images. The corresponding ground truth data was determined using laser data and Monte Carlo localization and recorded as well. This second dataset consisted of a total of 348 images. The global localization procedure was performed for all the images in both datasets with the topological map serving CCVW 2013 Poster Session Fig. 5. Examples of images used in the global localization experiment. 35 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia hypothesis in 90% of cases. On average, the first correct hypothesis is the 4th ranked hypothesis. For the highest ranked correct hypotheses, the error in position was on average approximately 37 mm, while the difference in orientation was on average approximately 0.6°. For 99% of these hypotheses, the pose error was at most 164 mm and 1.9°. REFERENCES [1] [2] [3] [4] [5] Fig. 6 Normalized cumulative histogram of the error in position (top-left) error in orientation (top-right) index of the first correct hypothesis (bottom). [6] of Fig. 5, where one can notice the similar repeating doorways on the left side of the corridor. [7] The accuracy of the proposed approach is determined using the 718 images with a correct hypothesis. The results are shown statistically in Table II as well as in Fig. 6, in terms of the absolute error in position and orientation between the correct pose and corresponding ground truth pose of the test sequence images as well as the index of the first correct hypothesis. The error bounds as well as the number of highest ranked hypotheses containing at least one correct hypothesis for 99% of samples are specially denoted in Fig. 6. [8] [9] [10] [11] TABLE II. STATISTICAL DETAILS OF THE GLOBAL LOCALIZATION POSE ERROR AND THE INDEX OF THE FIRST CORRECT HYPOTHESIS Avg. Std. Max. Translation Error (mm) 36.83 38.22 199.65 IV. Orientation Error (°) 0.62 0.56 1.99 Index of the first correct hypothesis 4.29 25.53 530.00 CONCLUSION In this paper a global localization approach based on the environment model consisting of planar surface segments is discussed. Planar surface segments detected in the local environment of the robot are matched to the planar surface segments in the model and the robot pose is estimated by registration of planar surface segment sets based on EKF. The result is a list of hypotheses ranked according to a measure of their plausibility. The considered approach is experimentally evaluated using depth image sequences acquired by a Microsoft Kinect sensor mounted on a mobile robot. The analyzed approach generated at least one correct pose CCVW 2013 Poster Session [12] [13] [14] [15] [16] [17] [18] [19] M. Cummins and P. Newman, “Highly scalable appearance-only SLAM - FAB-MAP 2.0,” in Proceedings of Robotics: Science and Systems, Seattle, USA, 2009. M. Milford, “Vision-based place recognition: how low can you go”, The International Journal of Robotics Research, vol. 32, no. 7, 2013, pp. 766–789. S. Thrun, W. Burgard, and D. Fox, “Probabilistic Robotics”, The MIT Press, 2005. S. Se, D. G. Lowe, and J. J. Little, “Vision-based global localization and mapping for mobile robots”, IEEE Transactions on Robotics, vol. 21, no. 3, 2005, pp. 364-375. A. Kosaka and A. Kak, “Fast vision-guided mobile robot navigation using model-based reasoning and prediction of uncertainties”, CVGIP: Image Understanding, vol. 56, 1992, pp. 271–329. O. Faugeras, “Three-Dimensional Computer Vision: A Geometric Viewpoint”. Cambridge, Massachusetts: The MIT Press, 1993. J. Stückler and S. Behnke, “Multi-Resolution Surfel Maps for Efficient Dense 3D Modeling and Tracking”, Journal for Visual Communication and Image Representation, 2013. D. Cobzas and H. Zhang, “Mobile Robot Localization using Planar Patches and a Stereo Panoramic Model”, in Proceedings of Vision Interface, 2001, pp. 94-99. M. F. Fallon, H. Johannsson, and J. J. Leonard, “Efficient scene simulation for robust monte carlo localization using an RGB-D camera”, in Proceedings of IEEE International Conference on Robotics and Automation, 2012, pp. 1663–1670. K. Pathak, A. Birk, N. Vaskevicius and J. Poppinga, “Fast Registration Based on Noisy Planes with Unknown Correspondences for 3D Mapping”, IEEE Transactions on Robotics, vol. 26, no. 3, 2010, pp. 424 – 441. P. Besl and N. McKay, “A Method for Registration of 3-D Shapes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 14, no. 2, 1992. S. Rusinkiewicz and M. Levoy, “Efficient Variants of the ICP Algorithm”, in Proceedings of the 3rd Int. Conf. on 3D Digital Imaging and Modeling (3DIM), 2001. R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges and A. Fitzgibbon, “KinectFusion: Real-Time Dense Surface Mapping and Tracking,” in Proceedings of the 10th Int. Symp. on Mixed and Augmented Reality (ISMAR), 2011. R. Cupec, E. K. Nyarko, D. Filko and I. Petrović, “Fast Pose Tracking Based on Ranked 3D Planar Patch Correspondences,” in Proceedings of IFAC Symposium on Robot Control, Dubrovnik, Croatia, 2012. F. Schmitt and X. Chen, “Fast segmentation of range images into planar regions”, in Proceedings of the IEEE Computer Society Conf. on Computer Vision and Pattern Recognition (CVPR), 1991, pp. 710-711. M. Garland, A. Willmott and P. S. Heckbert, “Hierarchical Face Clustering on Polygonal Surfaces,” in Proceedings of ACM Symposium on Interactive 3D Graphics, 2001. L. Matthies and S. A. Shafer, “Error Modeling in Stereo Navigation,” IEEE Journal of Robotics and Automation, vol. 3, 1987, pp. 239 – 248. N. Ayache and O. D. Faugeras, “Maintaining Representations of the Environment of a Mobile Robot,” IEEE Trans. on Robotics and Automation, vol. 5, no. 6, 1989, pp. 804–819. G. Bradski and A. Kaehler, “Learning OpenCV”, O'Reilly, 2008. 36 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia Multiclass Road Sign Detection using Multiplicative Kernel Valentina Zadrija Siniša Šegvić Mireo d. d. Zagreb, Croatia [email protected] Faculty of Electrical Engineering and Computing University of Zagreb Zagreb, Croatia [email protected] Abstract—We consider the problem of multiclass road sign detection using a classification function with multiplicative kernel comprised from two kernels. We show that problems of detection and within-foreground classification can be jointly solved by using one kernel to measure object-background differences and another one to account for within-class variations. The main idea behind this approach is that road signs from different foreground variations can share features that discriminate them from backgrounds. The classification function training is accomplished using SVM, thus feature sharing is obtained through support vector sharing. Training yields a family of linear detectors, where each detector corresponds to a specific foreground training sample. The redundancy among detectors is alleviated using kmedoids clustering. Finally, we report detection and classification results on a set of road sign images obtained from a camera on a moving vehicle. Keywords—multiclass object detection; object classification; road sign; SVM; multiplicative kernel; feature sharing; clustering I. INTRODUCTION Road sign detection and classification is an exciting field of computer vision. There are various applications of the road sign detection and classification in driving assistance systems, autonomous intelligent vehicles and automated traffic inventories. The latter one is of particular interest to us since traffic inventories include periodical on-site assessment carried out by trained safety expert teams. Current road condition is then manually compared against the reference state in the inventory. In current practice, the process is tedious, error prone and costly in terms of expert time. Recently, there have been several attempts to at least partially automate the process [1], [2]. This paper presents an attempt to partially automate this process in terms of road sign detection and classification using the multiplicative kernel. In general, detection and classification are typically two separate processes. The most common detection method is sliding window approach, where each window in the original image is evaluated by a binary classifier in order to determine whether the window contains an object. When the prospective object locations are known, classification stage is applied only on the resulting windows. One common classification approach includes partitioning the object space into subclasses and then training a dedicated classifier for each subclass. This approach is known as one-versus-all classification. In this paper, we focus on detection and classification of ideogram-based road signs as a one-stage process. We employ a jointly learned family of linear detectors obtained through Support Vector Machine (SVM) learning with multiplicative kernel as presented in [3]. In its original form, SVM is a binary classifier, but multiplicative kernel formulation enables the multiclass detection. Multiplicative kernel is defined as a product of two kernels, namely between-class kernel kx and within-class kernel k as described in Section IV. Betweenclass kernel kx is dedicated for detection, i.e. the foregroundbackground classification. Within-class kernel k is used for within-foreground classification, i.e. to discriminate between various object subclasses. The result of SVM training is a set of support vectors and corresponding weights which are then used to generate a family of detectors. The detectors are obtained by tuning within-class state into the multiplicative kernel as described in Section IV-B. The key point here is that all detectors share the same support vectors, but weights of specific support vectors vary depending on the within-class state value. We evaluate the described approach on a set of predefined road sign subclasses defined in Section III and present experimental results in Section V. II. RELATED WORK In recent years, a lot of work has been proposed in the area of multiclass object detection. We will address the issues from both road sign detection aspect as well as general multiclass object detection and feature sharing. The approach presented in [4] tries to solve multi-view face detection problem. Foreground class is partitioned into subclasses according to variations in face orientation. For each subclass, a corresponding detector is learned. This approach exhibits several problems when used with a larger number of subclasses. More specifically, number of features, number of false positives and the total training time grow linearly with the number of subclasses. This work has been supported by research projects Vista (EuropeAid/131920/M/ACT/HR) and Research Centre for Advanced Cooperative Systems (EU FP7 #285939). CCVW 2013 Poster Session 37 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia III. Fig. 1. Distribution of samples with respect to road sign subclasses S = {1, 2, 3, 4, 5} for training dataset. Below each subclass label v, representative subclass members are shown. Road signs shown for subclass v=1 are informal, i.e. subclass contains 9 different road signs. Further, authors in [5] focus on feature sharing using JointBoost procedure. In contrast to [4], where number of features grows linearly with the number of subclasses, the authors have experimentally shown that the number of features grows logarithmically with respect to the number of subclasses. Additionally, the authors showed that a jointly trained detector requires a significantly smaller number of features in order to achieve the same performance as a independent detector. The approach presented in [6] deals with problem of multiview detection using the so called Vector Boosted Tree procedure. At the detection time, the input is examined by a sequence of nodes starting from the root node. If the input is classified by the detector of current node as a member of the object class, it is passed to the node children. Otherwise, it is rejected as a background. The drawback of this approach is that it requires the user to predefine the tree structure and choose the position to split. Similar to [6], the approach presented in [7] also employs a classifier structured in a form of a tree. However, in contrast to [6], the tree classifier is constructed automatically - the node splits are achieved through unsupervised clustering. The algorithm is iterative, i.e. at the beginning it starts with an empty tree and training samples which are assigned weights. By adding a node to the tree, the sample weights are modified accordingly. Additionally, if a split is achieved, parent classifiers of all nodes along the way to the root node are modified. The concept of feature sharing is also explored through shape-based hierarchical compositional models [8], [9], [10]. These models are used for object categorization [8], [10], but also for multi-view detection, [9]. Different object categories can share parts or appearance. Parts on lower hierarchy levels are combined into larger parts on higher levels. In general, parts on lower levels are shared amongst various object categories, while those in higher levels are more category specific. This is similar to approach employed in this paper, however, in this paper, the feature sharing is obtained through a single-level support vector sharing. The approach presented in [1] describes a road sign detection and recognition system based on sharing features. The detection subsystem is a two stage process comprised of color-based segmentation and SVM shape detection. The recognition subsystem comprises GentleBoost algorithm and Rotation-Scale-Translation Invariant template matching. CCVW 2013 Poster Session PROBLEM DEFINITION In this paper, we focus on detection and classification of ideogram-based road signs. The class of all road signs exhibits various foreground variations with respect to the sign shape but also on presence or absence of thick red rim and ideogram type. We aim to jointly train detectors to discriminate road signs from backgrounds as well as to produce foreground variation estimates, i.e. subclass labels. We will use terms "foreground variation" and "subclass" interchangeably in the rest of the paper denoting the same concept. Fig. 1 depicts within-class road sign subclasses which we aim to estimate. The class of all road signs is comprised out of five variations denoted with label v from S = {1, 2, 3, 4, 5}. Subclass v=1 includes triangular warning signs with thick red. A small subset of the subclass members is shown in Fig. 1, i.e. the subclass contains nine different road signs. These signs belong to the category A according to the Vienna Convention [11]. Subclass v=2 contains circular "End of no overtaking zone" sign which belongs to the category C of informative signs. On the other hand, members of subclass v=3 "Priority road" and "End of Priority Road" are rhomb-shaped. Subclass v=4 includes square-shaped signs which are a subset of informative category C road signs. Finally, the subclass v=5 conatins circular "Speed Limit" signs characterized with the thick red rim which belongs to the category B of prohibitory signs. We discuss the described road-sign variations in our dataset and the motivation for our approach as follows. First, we discuss motivation for partitioning road signs into subclasses according to the distribution shown in Fig. 1. The dataset is extracted from the video recorded with camera mounted on top of a moving vehicle. Video sequences are recorded at daytime, at different weather conditions [2]. Further, as Fig. 1 shows, the distribution of signs in the dataset is unbalanced, i.e. certain variations like triangular signs are characterized with large number of instances, while some others have a small number of occurrences. In particular, the "End of Priority Road" sign shown as a part of the subclass v=3 in Fig. 1 has only nine instances in training dataset. In the approach where we build a single detector for a particular subclass, it is clear that detector trained with only nine samples would have very poor detection rate. However, the "End of Priority Road" sign shares the same shape as the "Priority Road" sign. If we were to group them into a single subclass, we could exploit foreground-variation feature sharing. Other heterogeneous subclasses are designed with the same motivation. Note that the subclass v=5 is also a heterogeneous subclass, i.e. it contains various speed limit signs which share the thick red rim and the zero digit, since the speed limits are usually multiples of ten. Second, according to the subclasses defined in Fig. 1, observe that signs belonging to different subclasses also share similarities. For example, the "End of no overtaking zone" (subclass v=2) sign and "End of Priority Road" (subclass v=3) both share the same distinctive crossover mark. This similarity could improve discrimination capability of both signs with respect to the background class. This suggests that if we solve detection and classification problem for all subclasses together, we could benefit from within-class feature sharing. 38 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia Therefore, due to the described characteristics of the dataset distribution, as well as the nature of sign similarities, we decided to employ a method presented in [3] where a classification function is learned jointly for all within-class variations. The aim of this approach is to form the classification function which could exploit the fact that different variations share features against backgrounds, but at the same time provide within-class discrimination. IV. DETECTION AND CLASSIFICATION APPROACH The overall road sign detection and classification process is shown in Fig. 2. The detailed description is as follows. For a given feature vector x ∈ ℝn computed for an image patch, the goal is to decide whether it represents an instance of a road sign class and, if so, to produce the corresponding subclass estimate v from S = {1, 2, 3, 4, 5}. Let xi denote feature vector of the i-th road sign training sample belonging to the subclass with label vi. The feature vectors are given as HOG features [15]. According to the above defined parameters, the classification function C(x,i) is defined as follows: 0, x is foreground from the same subclass as xi C ( x, i) 0, otherwise This corresponds to the non-parametric approach presented in [3]. The parametric approach [3] is simpler, however, it requires each subclass v to be described with a specific parameter. The role of parameter is to describe members of the subclass in a unique way. In multiview domain, the parameter typically corresponds to the view angle or object pose. However, in the road sign domain the subclasses are heterogeneous and cannot be described with a single parameter. For this reason, the classification function employs the foreground sample feature vector xi in order to describe a specific subclass. Since the road sign subclasses are designed with the goal that signs within the subclass are similar, it is to be expected that they will also exhibit similar feature vector values xi. In order to satisfy requirements of within-class feature sharing against the background class, as well as within-class discrimination, the classification function C(x,i) is represented as a product of two kernels: C ( x, i) s k ( xs , xi ) k x ( xs , x) sSV Parameter s denotes the Lagrange multiplier of the s-th support vector [12], xs the particular support vector, xi the i-th foreground training sample, while k(xs, xi) denotes the withinclass kernel and kx(xs, xi) the between-class kernel. In addition, the product of first two terms in equation (2) can be summarized into a single term s'(i), which denotes the weight of the s-th support vector xs for foreground sample xi: s ' (i) s k ( xs , xi ) As a result, the support vectors for which the within-class kernel kyields higher values will have a larger influence on CCVW 2013 Poster Session Fig. 2. Training and detector construction outline the classification function. In this way, we achieve the withinclass feature sharing as well as the within-class discrimination. A. Classification Function Training The classification function (2) training is achieved using SVM. The training samples take form of tuples (x, i). Each foreground training sample x is assigned its corresponding sample index i. Background training samples x are obtained from image patches without road signs. Each background training sample x can be associated with any index of a foreground training sample in order to form a valid tuple. More specifically, background samples x is a negative with respect to all foreground samples. The number of such combinations is huge and corresponds to # ( NB) # ( NF ) The parameter #(NB) corresponds to the total number of backgrounds and #(NF) to the total number of foreground training samples. Due to combinatorial complexity, including all negative samples in SVM training would not be practical. Therefore, the bootstrap training is employed as a hard negative mining technique. In bootstrap training, only #(NB) negatives are initially included in training. These samples are assigned foreground sample indices in a random fashion. After each training round, all negative samples are evaluated by the classification function (2). False positives are added to the negative set and the SVM training is repeated. This is an iterative process which converges when there are no more false positives to add. B. Individual Detector Construction C(x,i) is learned as a function of a foreground variation parameter i, rather than learning separate detectors for each i. Individual detectors w(x,i) are obtained from the classification function by plugging in specific foreground sample values xi into (2) and (3) w(x, i) '(i) k ( x , x) s x s sSV 39 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia Note that with fixed foreground variation i and known set of support vectors xs, we can precompute the within-class kernel values k(xs, xi) and consequently support vector weights s'(i). Therefore, the within-class kernel kis not evaluated at detection time. Rather, at detection time, we only evaluate the between-class kernel kx. This fact affects our choice for withinclass and between-class kernels. First, we discuss within-class kernel k. Road sign subclasses are difficult to separate and is therefore important for k to be able to separate nonlinear problems. Therefore, Gaussian RBF kernel was chosen for that purpose k ( x, xi ) exp( D( x, xi )) where D(x,xi) denotes Euclidian distance and corresponding parameter. Due to the fact that RBF is evaluated only during training and detector construction, this choice doesn't impose performance penalty during detection. Secondly, since the between-class kernel kx is evaluated during detection, it is important for kx to be fast. Therefore, we chose linear kernel for that purpose. By substituting the linear kernel formulation kx(x,xi) = xiT∙x into (5) we obtain the final form of our detectors: w(x, i) ( ) ( x ' s T s x) w(i) x sSV In this way, the detection is achieved by applying a simple dot product between the image patch and the detector weights denoted as w(i). Note that all detectors share the same set of support vectors. In this way, feature sharing among various detectors is achieved. C. Detection Approach In the detection process, we employ the well known sliding window technique. Each window is evaluated by a family of linear detectors w(x,i) constructed according to (7). From all detector responses, the one with maximum value is chosen as a result. If this value is positive, the window is classified as a road sign belonging to the subclass of xi. Otherwise, the window is discarded as a background. Note that this is similar to the k-nearest neighbors (k-NN) method, with parameter k=1, i.e. the object is simply assigned to the class of the nearest neighbor selected among all detectors [13]. However, in the 1-NN approach, the number of evaluated detectors is significant, i.e. corresponds to the number of foreground samples #(NF). Evaluating all #(NF) detectors at the detection stage would make the detection extremely slow. In addition, since foreground samples belonging to the same subclass are similar, there may be redundancy among detectors. In order to identify a representative set from a family of total #(NF) detectors, we use the k-medoids clustering technique. The k-medoids technique is chosen due to its simplicity and also because it is less prone to outlier influence than, for instance, k-means method. Clustering yields a set of k < #(NF) medoids which are then used in the detection phase. In clustering, each detector w(x, i) is represented with a vector of its support vector weights: CCVW 2013 Poster Session ' ' (i) 1' (i), 2' (i), , SV (i) where s'(i), s{1,..., SV} denotes the particular support weight defined with (3), while SV denotes total number of support vectors. As a distance measure, we used Dist (i, j ) defined as follows: T Dist (i, j ) 1 ' (i) ' ( j ) ' (i) ' ( j ) The appropriate number of medoids k is chosen in an iterative process, where we decrease number of medoids gradually and measure the clustering quality. Initially, the target number of medoids, i.e. cluster centers k is set to 50% of the initial number of detectors, i.e. foreground samples. This number was chosen by a rule of thumb. Then, we apply clustering according to the chosen number of medoids k. In order to measure clustering quality, we compute corresponding silhouette value [14]. The silhouette value provides an estimate of how the well obtained medoids represent the data in their corresponding clusters. In each iteration, the target number of clusters is decreased by a certain factor and the above process is repeated. Final clustering outline is chosen as the one which yields the best silhouette value. V. EXPERIMENTAL RESULTS In this section, we describe the evaluation of the above described method according to the foreground variation distribution presented in Section III. In our experiments, we used three disjoint datasets, described below [2]. As we already stressed, the distribution shown in Fig. 1 is unbalanced, i.e. subclass v=1 contains at least three times more samples that other classes. In such unbalanced datasets, there is a possibility for specific road sign variation with smaller number of samples to be treated as a noise, e.g. subclass v=3. In order to test this hypothesis, we experimented with the number of samples per class. Namely, we observed detection performance for NPOS=200, 300 and 400 samples per subclass. For the subclasses with the number of samples lower than NPOS, we simply use all the samples for the subclass, e.g. subclasses v=2, v=3 and NPOS=400. The training samples are extracted randomly from training dataset pool comprising 2153 road signs. Negative dataset contains 4000 image patches extracted randomly from images of background outdoor scenes. In order to monitor clustering performance, we use the test dataset comprised out of 3000 cropped images. Details of the test dataset are given in Section V-B. A. Training Approach The training of detectors is achieved as described in Section IV. As features, we used HOG vectors [15] computed from training images cropped and scaled to 24 x 24 pixels. Cell size is set to 4 pixels, where each cell is normalized within a block of four cells. In order to increase performance, we use block overlapping with block stride set to size of single cell. The training is achieved using SVMlight [16] with multiplicative kernel. In contrast to [3], where parameter of the RBF kernel 40 Proceedings of the Croatian Computer Vision Workshop, Year 1 (a) (b) Fig. 3. Examples of a detection and classification: (a) correct classification, (b) correct classification and false positive. (6) is set to a fixed value, we perform cross validation on training dataset in each bootstrapping round in order to obtain the best value. We compared both training approaches and the one with optimized value yields a lower number of support vectors. This suggests a better mapping in the transformed feature space. More specifically, with training set comprised out of 1325 positive samples and 4000 negative samples, the training with the fixed value yields a set of 857 support vectors. On the other hand, the training with the optimized decreases the number of support vectors for 10%. The training yields a family of #(NF) detectors. The #(NF) corresponds to the number of foreground training samples which depends on the number of training samples per subclass NPOS, Table I. Due to performance reasons, we use k-medoids clustering in order to select a representative set of detectors from total #(NF) detectors. The clustering is implemented in Matlab according to the Partitioning Around Medoids method [17]. In each clustering iteration, we monitor the silhouette value and the validation results on the test dataset comprised out of cropped images. Note that this dataset is disjoint from test dataset used for detection. Interestingly, better silhouette values correspond to a smaller number false negatives obtained from validation on test data. The results of clustering are summarized in Table I. depending on the NPOS and #(NF). Resulting number of detectors k corresponds approximately to 30% of total training samples #(NF). Lower k values lead to poor validation results and small silhouette values. B. Detection and Classification Results The test dataset for detection evaluation contains 1038 images in 720x576 resolution. From the total 1038 images, we selected 200 images and used them for performance evaluation. In these images, there were 214 physical road signs. Table II. shows the detection and foreground estimation results for the three case studies. We report the detection rate D, the classification rate C, the false positive rate FP and the false positive rate per image FP/I. D is defined as a number of detected signs with respect to the total number of signs, while C and FP correspond to the number of correct classifications and false detections with respect to the total number of signs, respectively. FP/I corresponds to the number of false detections with respect to the number of images. Columns denoted with sign show differences in above metrics with respect to configuration denoted by NPOS=200. We started the experiment with NPOS=200 samples per subclass. This configuration achieves overall detection rate of CCVW 2013 Poster Session September 19, 2013, Zagreb, Croatia 90% with false positive rate of 43%. Next, NPOS=300 achieves 4% rise in detection rate giving total 94%, and 3% rise in classification rate, i.e. 93%. However, it is characterized with a large false positive rate of 55%. In sliding window detection, some computer vision libraries like OpenCV [18] employ false positive detection policy where an object must exhibit at least n detections in order to be accounted as a result. This is understandable, since sliding window technique exhibits multiple responses around single object. In this work, we didn't experiment with this property, however we believe that it would decrease false positive rate. Finally, the configuration NPOS=400 exhibits worse results with respect to NPOS=300. This supports our hypothesis that the unbalanced dataset NPOS=400 treats subclasses with a lower number of samples as a noise giving the lower overall detection rate. Table III. depicts D and C rates, as well as the FP rate per specific subclass v for configurations NPOS=300 and NPOS=200, respectively. The subclass v=1 achieves better results when a larger number of samples is used. This is understandable, since this subclass comprises a large number of different road signs. Interestingly, the subclass v=3 which has only 98 samples (8% of the total samples for subclass v=1) achieves detection and classification rate of 100% in all case studies. Subclasses v=2 and v=4 achieve lower detection and classification rates with respect to other subclasses. Subclass v=2 is circle-shaped but lacks red rim in order to share features with subclass v=5. On the other hand, subclass v=4 is rectangle-shaped and gains less benefit from feature sharing with other subclasses. Subclass v=5 obtains similar results for NPOS=300 and NPOS=200. FP distribution per subclass for NPOS=200 shows that subclass v=4 exhibits the largest number of false positives, i.e. 58%. Examples of false positives classified as members of subclass v=4 include building windows. On the other hand, NPOS=300 yields a rather balanced FP distribution, where subclasses v=1, 4 and 5 obtain FP rate of approximately 30%. Examples of detection and classification are given in Fig. 3a and Fig. 3b. Fig. 3a illustrates an example of a correct classification, where "Speed Limit" sign is classified as a member of subclass v=5 (orange dotted line), while the "Children" sign is classified as a member of subclass v=1 (green dotted line). Fig. 3.b illustrates correct classification, as well as two false positives. The "Priority Road" sign is correctly classified as a subclass v=3 (cyan dotted line). The "Weight Limit" sign was not present in training data, however, due to similarity with "Speed Limit" sign, it is classified as a member of subclass v=5 (orange dotted line). The latter one indicates within-class feature sharing. The triangle-like object is incorrectly classified as a member of subclass v=1 (green dotted line). VI. CONCLUSION In this paper, we considered a road-sign detection technique based on a multiplicative kernel. One of the major challenges was a poorly balanced dataset, where triangular warning signs have at least three times more instances than other subclasses. Our approach is based on a premise that different sign subclasses share features which discriminate them from backgrounds. Therefore, instead of learning a dedicated detector for each subclass, we trained single classification function for all subclasses using SVM [3]. Individual detectors 41 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia REFERENCES TABLE I. CLUSTERING RESULTS NPOS #(NF) k [1] 200 300 400 833 1132 1325 288 332 388 [2] TABLE II. DETECTION AND CLASSIFICATION NPOS D D C C FP FP FP/I FP/I 200 300 400 91% 94% 90% +3% -1% 90% 93% 90% +3% 0% 45% 55% 43% +10% -2% 47% 58% 45% +11% -3% TABLE III. [4] [5] RESULTS PER SUBCLASS NPOS = 300 NPOS = 200 [6] v D C FP D C FP 1 2 3 4 5 100% 82% 100% 88% 92% 100% 77% 100% 82% 92% 34% 0% 4% 31% 31% 91% 82% 100% 76% 93% 91% 82% 100% 71% 93% 15% 0% 5% 58% 22% are afterwards constructed from a shared set of support vectors. Major benefit of such approach with respect to separately trained detectors lies in feature sharing which enhances the detection rate for subclasses with lower number of samples. In comparison to partition based approaches, this approach does not require the training samples to be labeled with subclass parameters in order to learn classification function. This fact has proven to be useful for our domain, since the road sign subclasses are heterogeneous and it is hard to describe a subclass with a single parameter. Instead, each training sample is labeled with its corresponding HOG feature vector. In this way we obtain #(NF) subclass parameters and consequently #(NF) detectors from the classification function, where #(NF) corresponds to the number of foreground samples. However, due to performance issues, we perform clustering in order to obtain a representative set of detectors from a set of total #(NF) detectors. The reduced set of detectors is used in detection. Using the described method, we achieved the best detection rate of 94% at a relatively high false positive rate of 55%. We experimented with different numbers of samples per subclass in order to observe the effect on detection rate. The obtained results showed that detectors trained on a limited number of samples, i.e. 300 samples per class obtain better detection results with respect to the larger number of samples. Due to the fact that this method has shown promising results in road sign domain, in future work we plan to explore its applicability in the domain of multiview vehicle detection. ACKNOWLEDGMENT The authors wish to thank Josip Krapac for useful suggestions on early versions of this paper. CCVW 2013 Poster Session [3] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] J.-Y. Wu, C.-C. Tseng, C.-H. Chang, J.-J. Lien, J.-C. Chen, and C. T. Tu, “Road sign recognition system based on GentleBoost with sharing features,” in System Science and Engineering (ICSSE), 2011 International Conference on, 2011, pp. 410–415. S. Šegvić, K. Brkić, Z. Kalafatić, and A. Pinz, “Exploiting temporal and spatial constraints in traffic sign detection from a moving vehicle,” Machine Vision and Applications, pp. 1–17, 2011. [Online]. Available: http://dx.doi.org/10.1007/s00138-011-0396-y Q. Yuan, A. Thangali, V. Ablavsky, and S. Sclaroff, “Learning a family of detectors via multiplicative kernels,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 33, no. 3, pp. 514–530, 2011. P. Viola and M. Jones, “Fast Multi-View Face Detection,” in Proc. of IEEE Conf. Computer Vision and Pattern Recognition, 2003. A. Torralba, K. Murphy, and W. Freeman, “Sharing visual features for multiclass and multiview object detection,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 29, no. 5, pp. 854– 869, 2007. C. Huang, H. Ai, Y. Li, and S. Lao, “High-performance rotation invariant multiview face detection,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 29, no. 4, pp. 671–686, 2007. B. Wu and R. Nevatia, “Cluster boosted tree classifier for multi-view, multi-pose object detection,” in Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, 2007, pp. 1–8. S. Fidler and A. Leonardis, “Towards scalable representations of object categories: Learning a hierarchy of parts,” in Computer Vision and Pattern Recognition, 2007. CVPR ’07. IEEE Conference on, 2007, pp. 1–8. L. Zhu, Y. Chen, A. Torralba, W. Freeman, and A. Yuille, “Part and appearance sharing: Recursive compositional models for multiview,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, 2010, pp. 1919–1926. Z. Si and S.-C. Zhu, “Unsupervised learning of stochastic and-or templates for object modeling,” in ICCV Workshops, 2011, pp. 648– 655. Inland transport comitee, “Convention on road signs and signals,” Economic comission for Europe, 1968. C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc., 2006. D. Bremner, E. Demaine, J. Erickson, J. Iacono, S. Langerman, P. Morin, and G. Toussaint, “Output-sensitive algorithms for computing nearest-neighbour decision boundaries,” in Algorithms and Data Structures, ser. Lecture Notes in Computer Science, F. Dehne, J.-R. Sack, and M. Smid, Eds. Springer Berlin Heidelberg, 2003, vol. 2748, pp. 451–461. [Online]. Available: http://dx.doi.org/10.1007/978- 3-54045078-8 39 P. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,” J. Comput. Appl. Math., vol. 20, no. 1, pp. 53–65, Nov. 1987. [Online]. Available: http://dx.doi.org/10.1016/03 77-0427(87)90125-7 N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, 2005, pp. 886–893 vol. 1. T. Joachims, “Advances in kernel methods,” B. Scholkopf, C. J. C. Burges, and A. J. Smola, Eds. Cambridge, MA, USA: MIT Press, 1999, ch. Making large-scale support vector machine learning practical, pp. 169–184. [Online]. Available: http://dl.acm.org/citation.cfm?id=29909 4.299104 S. Theodoridis and K. Koutroumbas, Pattern Recognition, Fourth Edition, 4th ed. Academic Press, 2008. G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal of Software Tools, 2000. 42 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia A novel georeferenced dataset for stereo visual odometry Ivan Krešo Marko Ševrović Siniša Šegvić University of Zagreb Faculty of Electrical Engineering and Computing Zagreb, HR-10000, Croatia Email: [email protected] University of Zagreb Faculty of Transport and Traffic Engineering Zagreb, HR-10000, Croatia Email: [email protected] University of Zagreb Faculty of Electrical Engineering and Computing Zagreb, HR-10000, Croatia Email: [email protected] Abstract—In this work, we present a novel dataset for assessing the accuracy of stereo visual odometry. The dataset has been acquired by a small-baseline stereo rig mounted on the top of a moving car. The groundtruth is supplied by a consumer grade GPS device without IMU. Synchronization and alignment between GPS readings and stereo frames are recovered after the acquisition. We show that the attained groundtruth accuracy allows to draw useful conclusions in practice. The presented experiments address influence of camera calibration, baseline distance and zero-disparity features to the achieved reconstruction performance. I. I NTRODUCTION Visual odometry [1] is a technique for estimating egomotion [2] of a monocular or multiple camera system from a sequence of acquired images. The technique is interesting due to many interesting applications such as autonomous navigation or driver assistance, but also because it forms the basis for more involved approaches which rely on partial or full 3D reconstruction of the scene. The term was coined as such due to similarity with classic wheel odometry which is a widely used localization technique in robotics [3]. Both techniques estimate the current location by integrating incremental motion along the path, and are therefore subject to cumulative errors along the way. However, while the classic odometry relies on rotary wheel encoders, the visual odometry recovers incremental motion by employing correspondences between pairs of images acquired along the path. Thus the visual odometry is not affected by wheel slippage in uneven terrain or other poor terrain conditions. Additionally, its usage is not limited to wheeled vehicles. On the other hand, the visual odometry can not be used in environments lacking enough textured static objects such as in some tight indoor corridors and at sea. Visual odometry is especially important in places with poor coverage of GNSS (Global Navigation Satellite System). signal, such as in tunnels, garages or in space. For example, NASA space agency uses visual odometry in Mars Exploration Rovers missions for precise rover navigation on Martian terrain [4], [5]. We consider a specific setup where a stereo camera system is mounted on top of the car in the forward driving direction. Our goal is to develop a testbed for assessing the accuracy of various visual odometry implementations, which shall further be employed in higher level modules such as lane detection, lane departure or traffic sign detection and recognition. We decided to acquire our own GPS-registered stereo vision CCVW 2013 Poster Session dataset since, to our best knowledge, none of the existing freely available datasets [6], [7] features a stereo-rig with intercamera distance less than 20 cm (this distance is usually termed baseline). Additionally, we would like to be able to evaluate the impact of our camera calibration to the accuracy of the obtained results. Thus in this work we present a novel GPSregistered dataset acquired with a small-baseline stereo-rig (12 cm), the setup employed for its acquisition, as well as the results of some preliminary research. In comparison with other similar work in this field [7], [8], we rely on low budget equipment for data acquisition. The groundtruth motion for our dataset has been acquired by a consumer grade GPS receiver. The employed GPS device doesnt have an inertial measurement unit, which means that we do not have access to groundtruth rotation and instead record only the WGS84 position in discrete time units. Additionally, hardware synchronization of the two sensors can not be performed due to limitations of the GPS device. Because of that, the alignment of the camera coordinates with respect to the WGS84 coordinates becomes harder to recover as will be explained later in the article. Thus our acquisition setup is much more easily assembled at the expense of more post-processing effort. However we shall see that the attained groundtruth accuracy is quite enough for drawing useful conclusions about several implementation details of the visual odometry. II. S ENSOR SETUP Our sensor setup consists of a stereo rig and a GPS receiver. The stereo rig has been mounted on top of the car, as shown in Fig. 1. The stereo rig (PointGrey Bumblebee2) Fig. 1. The stereo system Bumblebee2 mounted on the car roof. 43 Proceedings of the Croatian Computer Vision Workshop, Year 1 features a Sony ICX424 sensor (1/3”), 12 cm baseline and global shutter. It is able to acquire two greyscale images 640×480 pixels each. The shutters of the two cameras are synchronized, which means that the two images of the stereo pair are acquired during the exactly same time interval (this is very important for stereo analysis of dynamic scenes). Both images of the stereo pair are transferred over one IEEE 1394A (FireWire) connection. The firewire connector is plugged into a PC express card connected to a laptop computer. The camera requires 12V power over the firewire cable. The PC express card is unable to draw enough power from the notebook and therefore features an external 12V power connector which we attach to the cigarette lighter power plug by a custom cable. In order to avoid overloading of the laptop bus, we set the frame rate of the camera to 25 Hz. The acquired stereo pairs are in the form of 640×480 pairs of interleaved pixels (16 bit), which means that upon acquisition the images need to be detached by placing each odd byte into the left image and each even byte into the right image. The camera firmware places timestamps in first four pixels of each stereo frame. These timestamps contain the value of a highly accurate internal counter at the time when the camera shutter was closed. The employed GPS receiver (GeoChron SD logger) delivers location readings at 1 Hz. It is a consumer-grade device with a basic capability for multipath detection and mitigation. The GPS antenna was mounted on the car roof in close proximity to the camera. The GPS coordinates are converted from WGS84 to Cartesian coordinates in meters by a simple spherical transformation, which is sufficiently accurate given the size of the covered area. The camera and GPS are not mutually synchronized, so that the offset between camera time and GPS time has to be recovered in the postprocessing phase. III. DATASET ACQUISITION The dataset has been recorded along a circular path throughout the urban road network in the city of Zagreb, Croatia. The path length was 688 m, the path duration was 111.4 s, while the top speed of the car was about 50 km/h. The dataset consists of 111 GPS readings and 2786 recorded frames with timestamps. The scenery was not completely static, since the video contains occasional moving cars in both directions, as well as pedestrians and cyclists. The acquisition was conducted at the time of day with the largest number of theoretically visible satellites (14). For comparison, the least value of that number at that day was 8. In practice, our receiver established connection to 9.5 satellites along the track, on average. Thus, at 99.1% locations we had HDOP (horizontal dilution of precision) below 1.3 m while HDOP was less than 0,9 m at 57.1% locations. The obtained GPS accuracy has been qualitatively evaluated by plotting the recorded track over a rectified digital aerial image of the site, as shown in Fig.2. The figure shows that the GPS track follows the right side of the road accurately and consistently, except at bottom right where our car had to avoid parked cars and pedestrians. Thus, the global accuracy of the recorded GPS track appears to be in agreement with the HDOP figures stated above. Furthermore, the relative motion between the neighbouring points which we use in our quantitative experiments is much more accurate than that and approaches CCVW 2013 Poster Session September 19, 2013, Zagreb, Croatia Fig. 2. Projection of the acquired GPS track onto rectified aerial imagery. DGPS accuracy. We note that the aerial orthophoto from the figure has been provided by Croatian Geodetic Administration, while the conversion of WGS84 coordinates into Croatian projected coordinates (HTRS96) and visualization are carried out in Quantum GIS. For additional verification, the same experiment has been performed in Google Earth as well, where we obtained very similar results. IV. V ISUAL ODOMETRY As stated in the introduction, the visual odometry recovers incremental motion between subsequent images of a video sequence by exclusively relying on image correspondences. We shall briefly review the main steps of the technique in the case of a calibrated and rectified stereo system. Calibration of the stereo system consists of recovering internal parameters of the two cameras such as the field of view, as well as exact position and orientation of the second camera with respect to the first one. Rectification consists in transforming the two acquired images so they correspond to images which would be obtained by a system in which the viewing directions of the two cameras are mutually parallel and orthogonal to the baseline. Calibration and rectification significantly simplify the procedure of recovering inter-frame-motion as we outline in the following text. One typically starts by finding and matching the corresponding features in the two images. Because our system is rectified the search for correspondences can be limited to the same row of the other image. Since the stereo system is calibrated, these correspondences can be easily triangulated and subsequently expressed in metric 3D coordinates. Then the established correspondences are tracked throughout the subsequent images. In order to save time, tracking and stereo matching typically use lightweight point features such as corners or blobs [9]–[11]. The new positions of previously triangulated points provide constraints for recovering the new location of the camera system. The new location can be recovered either analytically, by recovering pose from projections of known 3D points [12], [13], or by optimization [11], [14], [15]. The location estimation is usually performed only in some images of the acquired video sequence, which are often referred to as key-images. 44 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia Thus, we have seen that the visual odometry is able to provide the camera motion between the neighboring keyimages. This motion has 6 degrees of freedom (3 rotation and 3 translation), and we represent this motion by a 4×4 transformation matrix which we denote by [Rt]. This matrix contains parameter of the rotation matrix R and the translation vector t, as shown in equation (1). r11 r12 r13 t1 r22 r23 t2 r [Rt] = 21 (1) r31 r32 r33 t3 0 0 0 1 From a series of [Rt] matrices we can calculate the full path trajectory by cumulative matrix multiplication. This is shown in equation (2) where the camera location pvo (tCAM ) is determined by multiplying matrices from the key-image 1 to the key-image t. pvo (tCAM ) = t Y [Rt]i (2) i=1 At this moment we must observe that the recovered camera locations pvo (tCAM ) are expressed in the temporal and spatial coordinates of the 0th key-image which is therefore acquired at time tCAM = 0 s. The coordinate axes x and y of the trajectory pvo (tCAM ) are aligned with the corresponding axes of the 0th key-image, while the z axis is perpendicular to that image. The locations estimated by visual odometry will be represented in this coordinate system by accumulating all estimated transformations from the 0th frame. If we wish to relate these locations with GPS readings, we shall need to somehow recover the translation and rotation of the 0th keyframe with respect to the GPS coordinates. We shall achieve that by aligning the first few incremental motions with the corresponding GPS motions. However, before we do that, we first need to achieve the temporal synchronization between the two sensors. V. S ENSOR SYNCHRONIZATION As stated in section II, the camera and GPS receiver are not synchronized. Thus, we need to recover the time interval t0vo between the 0th video frame and the 0th GPS location, such that the camera time tCAM corresponds to t + t0vo in GPS time (t = 0 corresponds to the 0th GPS location). We estimate t0vo by comparing absolute incremental displacements of trajectories pgps (t) and pvo (tCAM ) obtained by GPS and visual odometry, respectively. In order to do that, we first define ∆p(t, ∆t) as the incremental translation at time t: ∆p(t, ∆t) = p(t) − p(t − ∆t) . (3) We also define ∆s(t) as the absolute travelled distance during the previous interval of ∆tGPS =1 s (GPS frequency is 1 Hz): ∆s(t) = k∆p(t, ∆tGPS )k, t ∈ N . (4) Now, if we consider the time instants in which the GPS positions are defined (that is, integral time in seconds), we can pose the following optimization problem: t̂0vo X Tlast −1 = argmin CCVW 2013 Poster Session t0vo t=1 (∆svo (t + t0vo ) − ∆sgps (t))2 . (5) The problem is well-posed since the absolute incremental displacements ∆svo and ∆sGP S are agnostic with respect to the fact that the camera and GPS coordinate systems are still misaligned. We see that in order to solve this problem by optimization, we need to interpolate all locations obtained by visual odometry at times of GPS points for each considered time offset t0vo . However, that does not pose a computational problem due to very low frequency of the GPS readings (there are only 111 GPS locations in our dataset). Thus the problem can be easily solved by any optimization algorithm, and so we can express visual odometry locations in GPS time pvo0 as: pvo0 (t) = pvo (t + t0vo ) . (6) The interpolation is needed due to the fact that we capture images at approximately 25 Hz and GPS data with 1 Hz (cf. section II). The time intervals between two subsequent images often differ from expected 40 ms due to unpredictable bus congestions within the laptop computer (this is the reason why camera records timestamps in the first place). We recover the locations of visual odometry ”in-between” the acquired frames by the following procedure. We first accumulate timestamps in the frame sequence until we reach the two frames which are temporally closest to the desired GPS time. Finally we determine the desired location between these two frames using linear interpolation. We also can relate the two trajectories by the absolute incremental rotation angle ∆φ between the corresponding two time instants. We can recover this angle by looking at three consecutive locations as follows: ∆φ(t) = arccos h∆p(t, ∆tGPS ), ∆p(t + 1, ∆tGPS ), i (7) k∆p(t, ∆tGPS )kk∆p(t + 1, ∆tGPS )k This procedure is illustrated in Fig. 3. Thus we propose two metrics suitable for relating the trajectories obtained by GPS and visual odometry: ∆s(t) and ∆φ(t). Note that in order to be able to determine these metrics for visual odometry at GPS times, one needs to apply the previously recovered offset t0vo and employ interpolation between the closest image frames. Fig. 3. The absolute incremental rotation angle ∆φ can be determined from incremental translation vectors. VI. A LIGNING THE SENSOR COORDINATES Now that we have synchronized data between camera and GPS, we need to estimate the 3D alignment of the reference coordinate system of the visual odometry with respect to the GPS coordinate system. In other words we need to find a rigid transformation between the 0th GPS location and the location of the 0th camera frame. After this is completed, we shall be able to illustrate the overall accuracy of the visual odometry results with respect to the GPS readings. 45 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia 250 The translation between the two coordinate systems can be simply expressed as: (8) (10) 100 50 0 50 200 150 100 50 x (m) MSE = 2.912, MAE = 1.512 12 E XPERIMENTS AND RESULTS In our experiments we employ the library Libviso2 which recovers the 6 DOF motion of a moving stereo rig by minimizing reprojection error of sparse feature matches [11]. Libviso2 requires rectified images on input. Currently, we consider each third frame from the dataset because, with 25 fps, camera movement between two consecutive frames is very small with urban driving speeds and such small camera movements often add more noise than useful data. We calibrated our stereo rig on two different calibration datasets by means of the OpenCV module calib3d. In the first case the calibration pattern was printed on A4 paper, while in the second the calibration pattern was displayed on 21” LCD screen with HD resolution (1920x1080). In both cases, the raw calibration parameters have been employed to rectify the input images and to produce rectified calibration parameters which are supplied to Libviso2. The resulting dataset and calibration parameters is freely downloadable for research purposes from http://www.zemris.fer.hr/∼ssegvic/datasets/fer-bumble.tar.gz. A. A4 calibration We first present the results obtained with the A4 calibration dataset. The resulting trajectories pvo00 (t) and pgps (t) are compared in Fig.4. We note that the shape of the trajectory is mostly preserved, however there is a large deviation in scale. We compare the corresponding incremental absolute displacements ∆sgps (t) and ∆svo00 (t), and the scale error ∆sgps (t)/∆svo00 (t) In Fig.5. We observe that the two graphs are well aligned, and that the visual odometry generally undershoots in translation motion compared to the GPS groundtruth. CCVW 2013 Poster Session 100 2.0 8 6 4 0 0 20 40 60 time (s) 80 100 1.8 1.6 1.4 1.2 1.0 0 120 20 40 60 time (s) 80 100 120 Fig. 5. Comparison of incremental translations (left) and the resulting scale error (right) as obtained with GPS and visual odometry with the A4 calibration. Note that the scale error is not constant. Thus the results could not be improved by simply fixing the baseline recovered by the calibration procedure. B. LCD calibration We now present the results obtained with the LCD calibration dataset. The GPS trajectory pgps (t) and the resulting visual odometry trajectory pvo00 (t) are shown in Fig.6. By comparing this figure with Fig.4, we see that a larger calibration pattern produced a huge impact on final results. 250 200 GPS visual odometry 150 z (m) VII. 50 2.2 GPS visual odometry 10 2 We note that this approach would be underdetermined if we had only straight car motion at the beginning of video, and to avoid that we arranged that our dataset begins on the road turn. Experiments showed that this approach works well in practice. Better accuracy could be obtained by employing a GPS sensor capable of producing rotational readings, however that would be out of the scope of this work. 0 Fig. 4. Comparison between the GPS trajectory and the recovered visual odometry trajectory with the A4 calibration. scale error (gps / odometry) pvo00 (t) = RGPS · pvo0 (t) + TGPS . vo vo z (m) In order to recover rotation alignment, we consider incremental translation vectors for visual odometry (∆pvo0 (t)) and GPS (∆pgps (t)) between two consecutive GPS times, as determined in (3). We find the optimal rotation alignment of the visual odometry coordinate system RGPS by minimizing the followvo ing error function: 2 N X hR · ∆pvo0 (t), ∆pgps (t)i R̂GPS = argmin arccos vo k∆pvo0 (t)kk∆pgps (t)k R t=1 (9) In order to bypass the accumulated noise which grows over time, we choose N = 4. This problem is easily solved by any nonlinear optimization algorithm, and so we can express visual odometry locations in GPS Cartesian coordinates and GPS time pvo00 as: 150 distance (m) TGPS = pGP S (0 s) − pvo0 (0 s) . vo GPS visual odometry 200 100 50 0 50 200 150 100 50 x (m) 0 50 100 Fig. 6. Comparison between the GPS trajectory and the recovered visual odometry trajectory with the LCD calibration. C. Influence of distant features While analyzing Libviso2 source code, we noticed that it rounds all zero disparities to 1.0. This is necessary since otherwise there would be a division by zero in the triangulation 46 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia 250 35 30 25 20 15 5 0 0 150 z (m) GPS rotation angles odometry rotation angles 40 10 GPS visual odometry 200 MSE = 24.05, MAE = 2.882 45 angle (deg) procedure. However, this means that all features with zero disparity are triangulated on a plane which is much closer to the camera than the infinity where it should be (note that the distance of that plane depends on the baseline). In order to investigate the influence of that decision to the recovered trajectory, we changed the magic number from 1.0 to 0.01. The effects of that change are shown in Fig.7, where we note a significant improvement with respect to Fig.6. 100 20 40 60 time (s) 80 100 120 Fig. 9. Comparison of incremental rotations as obtained with GPS and visual odometry with the LCD calibration and library correction. 50 to define mean square error (M SE) and mean absolute errors (M AE) for quantitative assessment of the achieved accuracy. 0 50 200 150 100 50 x (m) 0 50 100 Fig. 7. Comparison between the GPS trajectory and the recovered visual odometry trajectory with the LCD calibration and library correction. The absolute incremental translations and the resulting scale errors for this case are shown in Fig.8. The corresponding absolute incremental rotations are shown in Fig.9. We note a very nice alignment for incremental translation, and somewhat less successful alignment for incremental rotation. Note that large discrepancies in absolute incremental rotation at times 35 s and 97 s occur at low speeds (cf. Fig.8 (left)), when the equation (7) becomes less well-conditioned. MSE = 0.101, MAE = 0.249 14 1.6 scale error (gps / odometry) GPS visual odometry 12 distance (m) 10 8 6 4 2 0 0 1.5 1.4 1.3 1.2 1.1 1.0 0.9 20 40 60 time (s) 80 100 120 0.8 0 20 40 60 time (s) 80 100 120 Fig. 8. Comparison of incremental translations (left) and the resulting scale error (right) as obtained with GPS and visual odometry with the LCD calibration and library correction. Note that this issue could also have been solved by simply neglecting features with zero disparity. However, that would be wasteful since the features at infinity provide valuable constraints for recovering the rotational part of inter-frame motion. We believe that this has been overlooked in the original library since Libviso was originally tested on a stereo setup with 4x larger baseline and 2x larger resolution [11]. Thus the original setup entails much less features with zero disparity, while the effect of these features to the reconstruction accuracy is not easily noticeable. D. Quantitative comparison of the achieved accuracy We assess the overall achieved accuracy of the recovered trajectory pvo00 (t) by relying on previously introduced incremental metrics ∆φ and ∆s. These metrics shall now be used CCVW 2013 Poster Session M SEtrans = M AEtrans = N 1 X (∆svo00 (t) − ∆sgps (t))2 N t=1 N 1 X |∆svo00 (t) − ∆sgps (t)| N t=1 M SErot = M AErot = N 1 X (∆φvo00 (t) − ∆φgps (t))2 N t=1 N 1 X |∆φvo00 (t) − ∆φgps (t)| N t=1 (11) (12) (13) (14) The obtained results are presented in Table I. The rows of the table correspond to the metrics M SErot , M AErot , M SEtrans and M AEtrans . The columns correspond to the original library with the A4 calibration (A4), the original library with the LCD calibration (LCD1), and the corrected library with the LCD calibration (LCD2). A considerable improvement is observed between A4 and LCD1, while the difference between LCD1 and LCD2 is still significant. TABLE I. MSE AND MAE ERRORS IN DEPENDENCE OF CAMERA CALIBRATION AND DISTANT FEATURES TRIANGULATION . cases: A4 LCD1 LCD2 M SEtrans (m ) 2.912 0.111 0.101 M AEtrans (m) 1.512 0.260 0.249 M SErot (deg ) 24.914 24.856 24.053 M AErot (deg) 2.967 2.897 2.882 2 2 E. Experiments on the artificial dataset The previous results show that, alongside camera calibration, the feature triangulation accuracy has a large impact on the accuracy of visual odometry. In this section we explore that finding on different baselines of the stereo rig. To do this, we develop an artificial model of a rectified stereo camera system. On input, the model takes camera intrinsic 47 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia and extrinsic parameters as well as some parameters of the camera motion. Then it generates an artificial 3D point cloud and projects it to the image plane in every frame of camera motion. The point cloud is generated in a way so that its projections resemble features we encounter in real world while analysing imagery acquired from a driving car. On output, the model produces the feature tracks which are supplied as input to the library for visual odometry as input. Fig.10 shows the comparison of the straight trajectory in relation to different camera baseline setups. As one would expect, the obtained accuracy significantly drops as the camera baseline is decreased. Furthermore, Fig.11 shows the results on the same dataset after modification in treatment of zero-disparity features inside the library. The graphs in the figure show that for small baselines the modified library performs increasingly better than the original. This improvement occurs since the number of zero-disparity features increases as the baseline becomes smaller. significantly affect the reconstruction accuracy. Additionally, we have seen that features with zero disparity should be treated with care, especially in small-baseline setups. Finally, the most important conclusion is that small-baseline stereo systems can be employed as useful tools in SfM analysis of video from the driver’s perspective. Our future research shall be directed towards exploring the influence of other implementation details to the reconstruction accuracy. The developed implementations shall be employed as a tool for improving the performance of several computer vision applications for driver assistance, including lane recognition, lane departure warning and traffic sign recognition. We also plan to collect a larger dataset corpus which would contain georeferenced videos acquired by stereo rigs with different geometries. ACKNOWLEDGMENT This work has been supported by research projects Vista (EuropeAid/131920/M/ACT/HR) and Research Centre for Advanced Cooperative Systems (EU FP7 #285939). X (m) R EFERENCES groundtruth b = 0.5 b = 0.2 b = 0.12 b = 0.09 0 Fig. 10. 50 100 Z (m) 150 [1] [2] 200 [3] Reconstructed trajectories with original Libviso2. [4] X (m) [5] groundtruth b = 0.5 b = 0.2 b = 0.12 b = 0.09 0 Fig. 11. 50 100 Z (m) 150 200 Reconstructed trajectories with modified Libviso2. VIII. C ONCLUSION AND FUTURE WORK We have proposed a testbed for assessing the accuracy of visual odometry with respect to the readings of an unsynchronized consumer-grade GPS sensor. The testbed allowed us to acquire an experimental GPS-registered dataset suitable for evaluating existing implementations in combination with our own stereo system. The acquired dataset has been employed to assess the influence of calibration dataset and some implementation details to the accuracy of the reconstructed trajectories. The obtained experimental results show that a consumer grade GPS system is able to provide useful groundtruth for assessing performance of visual odometry. This still holds even if the synchronization and alignment with the camera system is performed at the postprocessing stage. The experiments also show that the size and the quality of the calibration target may CCVW 2013 Poster Session [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] R. Bunschoten and B. Krose, “Visual odometry from an omnidirectional vision system,” Robotics and Automation, 2003. Proceedings. ICRA ’03., pp. 577–583, 2003. C. Silva and J. Santos-Victor, “Robust egomotion estimation from the normal flow using search subspaces,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, pp. 1026 – 1034, 1997. J. Borenstein, H. R. Everett, L. Feng, and D. Wehe, “Mobile robot positioning: Sensors and techniques,” Journal of Robotic Systems, vol. 14, no. 4, pp. 231–249, Apr. 1997. S. Goldberg, M. Maimone, and L. Matthies, “Stereo vision and rover navigation software for planetary exploration,” Aerospace Conference Proceedings, 2002. IEEE, vol. 5, 2002. C. Yang, M. Maimone, and L. Matthies, “Visual odometry on the mars exploration rovers,” Systems, Man and Cybernetics, 2005 IEEE International Conference, vol. 1, pp. 903–910, 2005. “Image sequence analysis test site (eisats),” http://www.mi.auckland.ac. nz/EISATS, the University of Auckland, New Zealand. A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” International Journal of Robotics Research, 2013. A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2012. C. Harris and M. Stephens, “A combined corner and edge detector,” in Proceedings of the Alvey Vision Conference, 1988, pp. 147–152. J. Shi and C. Tomasi, “Good features to track,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, Seattle, Washington, Jun. 1994, pp. 593–600. B. Kitt, A. Geiger, and H. Lategahn, “Visual odometry based on stereo image sequences with ransac-based outlier rejection scheme,” in Proceedings of the IEEE Intelligent Vehicles Symposium. Karlsruhe Institute of Technology, 2010. L. Quan and Z.-D. Lan, “Linear n-point camera pose determination,” IEEE Transactions on Pattern recognition and Machine Intelligence, vol. 21, no. 8, pp. 774–780, Aug. 1999. D. Nistér, O. Naroditsky, and J. R. Bergen, “Visual odometry for ground vehicle applications,” Journal of Field Robotics, vol. 23, no. 1, pp. 3–20, 2006. L. Morency and R. Gupta, “Robust real-time egomotion from stereo images,” Image Processing, 2003. ICIP 2003. Proceedings. 2003 International Conference on, pp. 719–22, 2003. C. Engels, H. Stewénius, and D. Nistér, “Bundle adjustment rules,” in Photogrammetric Computer Vision, 2006, pp. 1–6. 48 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia Surface registration using genetic algorithm in reduced search space Vedran Hrgetić, Tomislav Pribanić Department of Electronic Systems and Information Processing Faculty of Electrical Engineering and Computing Zagreb, Croatia [email protected], [email protected] Abstract – Surface registration is a technique that is used in various areas such as object recognition and 3D model reconstruction. Problem of surface registration can be analyzed as an optimization problem of seeking a rigid motion between two different views. Genetic algorithms can be used for solving this optimization problem, both for obtaining the robust parameter estimation and for its fine-tuning. The main drawback of genetic algorithms is that they are time consuming which makes them unsuitable for online applications. Modern acquisition systems enable the implementation of the solutions that would immediately give the information on the rotational angles between the different views, thus reducing the dimension of the optimization problem. The paper gives an analysis of the genetic algorithm implemented in the conditions when the rotation matrix is known and a comparison of these results with results when this information is not available. Keywords – computer vision; reconstruction; genetic algorithm I. surface registration; 3D INTRODUCTION Surface registration is an important step in the process of reconstructing the complete 3D object. Acquisition systems can mostly give only a partial view of the object and for its complete reconstruction should capture from the multiple views. The task of the surface registration algorithms is to determine the corresponding surface parts in the pair of observed clouds of 3D points, and on that basis to determine the spatial translation and rotation between the two views. Various techniques for the surface registration are proposed and they can be generally divided into the two groups: coarse and fine registration methods [1]. In the coarse registration the CCVW 2013 Poster Session main goal is to compute an initial estimation of the rigid motion between the two views, while in the fine registration the goal is to obtain the most accurate solution by refining a known initial estimation of the solution. The problem of finding the required rigid motion can be viewed as solving a six dimensional optimization problem (translation on x-, y- and z-axis, and rotation about x-, y- and z-axis). Genetic algorithms (GA) can be used to robustly solve this hard optimization problem and their advantage is that they are applicable both as coarse and fine registration methods [1]. In this work we propose a surface registration method assuming that the rotation is provided by an inertial sensor and the translation vector is still left to be found. We note that the registration method, based on the abovementioned assumption, has been successfully tested earlier [4]. However, in this work we specifically employ GA in that context. We justify our assumption by the fact that nowadays technologies have made quite affordable a large pallet of various inertial devices which reliably outputs data about the object orientation. For example, an inertial sensor can be explicitly used [2], or alternatively, there are also smart cameras which have an embedded on-board inertial sensor unit for the orientation detection in 3D space [3]. II. BACKGROUND Genetic algorithm is a metaheuristic based on the concept of natural selection and evolution (Fig. 1.). GA represents each candidate solution as an individual that is defined by its parameters (i.e. its genetic material) and each candidate solution is qualitatively evaluated by the fitness function. Better solutions are more likely to reproduce, and the system of reproduction is defined by the crossover and mutation. The use of genetic algorithms has the advantage of avoiding the local 49 Proceedings of the Croatian Computer Vision Workshop, Year 1 minima which is a common problem in registration, especially when the initial motion is not provided. Another advantage of genetic algorithms is that they work well in the presence of noise and outliers given by the non-overlapping regions. The main problem of GA is the time required to converge, which makes it unsuitable for online applications. Chow presented a dynamic genetic algorithm for surface registration where every candidate solution is described by a chromosome composed of the 6 parameters of the rigid motion that accurately aligns a pair of range images [5]. The chromosome is composed of the three components of the translation vector and the three angles of the rotation vector. In order to minimize the registration error, the median of Euclidean distances between corresponding points in two views is chosen as the fitness function. A similar method was proposed the same year by Silva [6]. The main advantage of this work is that the more robust fitness function is used and the initial guess is not required. September 19, 2013, Zagreb, Croatia corresponding points instead of mean for evaluation of solutions, initial tests have proven that mean gives better practical results. This can be explained by the different use of GA, where Chow uses GA as a fine registration method, we use GA as both coarse and fine registration methods where in the absence of a good initial solution using median can cause algorithm to converge to a local minima more easily. Finally, we evaluate the proposed GA method assuming that the rotation is known in advance. A. Implementation of genetic algorithm We use two genetic algorithms in a sequence. First is used as the coarse registration method and produces an initial estimate of the solution, while other is used as the fine registration method (Fig. 2.). Crossover and mutation are basic genetic operators that are used to generate the new population of solutions from the previous population. In our work we use the crossover method that is accomplished by generating two random breaking points between six variables of the chromosome and mutation is defined as the addition or subtraction of a randomly generated parameter value to one of the variables of the chromosome. Probability for the mutation and the maximum value for which a variable can mutate dynamically change during the process of execution. Fig. 2. Implementation of genetic algorithm Population size and number of generations are two important factors of GA design. It is often cited in the literature that the population of 80 to 100 individuals is sufficient to solve more complex optimization problems [7]. Overpopulation may affect the ability of more optimal solutions to successfully reproduce (survival of the fittest) and thus prevent the convergence of algorithm towards the desired solution. If the population is too small then there’s a greater danger that the algorithm will converge to some local extreme. After initial testing we have chosen parameters for GA as displayed in Table 1. Fig. 1. Genetic algorithm TABLE I. SIZE OF POPULATION III. PROPOSED METHOD In this work we have used the same definition of the chromosome as was suggested by Chow in his paper [5]. The fitness function is defined as an overall mean of all the Euclidean distances between the corresponding points of the two clouds of 3D points for a given rigid motion. The objective of genetic algorithm is to find a candidate solution of the rigid motion that minimizes the overall mean distance between the corresponding points of the two views. Although Chow claims that it is better to use median of Euclidean distances between CCVW 2013 Poster Session GA population generations coarseGA 100 250 fineGA 50 250 CoarseGA function implements GA as a coarse registration method. Population is initialized by generating the initial correspondences in a way that the point that is closest to the center of mass from the first cloud of points is randomly matched with points in the second cloud of points. Translation 50 Proceedings of the Croatian Computer Vision Workshop, Year 1 vector for each such motion between views is determined. The angles of rotation vector are then initialized to randomly generated values where every solution is rotated only for one angle, or if we adopt limited space search, a priori known values of angles are inserted in the algorithm. In the coarseGA function elitism principle is implemented, transferring best two solutions directly to the next generation. The remaining generation is generated by the combined processes of crossover and mutation. Mutation probability for each gene is 16%. The algorithm is performed for fixed 250 generations. FineGA inherits the last population of coarseGA. In the each iteration fineGA keeps the two best solutions, while 44 new individuals are generated by the processes of crossover and mutation and four new individuals are generated by the process of mutation only. Starting mutation probability is 20% and it increases for 5% every 25 iterations, while the maximum value for which the parameter can change decreases by a factor of 0.8. A rationale why we dynamically adjust the mutation parameters is that as the population converge to a good solution, smaller adjustments of genes are needed and therefore mutation becomes primal source for improvement of genetic material. That means that a mutation needs to happen more often, but at the same time it should produce less radical changes in genetic material. Finding corresponding points in two clouds of 3D points is extremely time consuming operation, so to minimize a overall time of execution both downsampling of the 3D images and a KDtree search method are used. KDtree method for finding closest points in two clouds of points was suggested in earlier works [8] and has proven to be very fast compared to other methods. IV. RESULTS Software was implemented in MATLAB development environment. 13 3D test images were obtained by partial reconstruction of a model doll using a structured light scanner. An algorithm was tested both in situation where no a priori information about rotational angles is available and where exact values of angles are known. In order to come up with the known ground true values we have manually marked 3-4 corresponding points on each pair of views. That allowed us to compute a rough estimate for the ground truth data which we have further refined using the well-known iterative closest point (ICP) algorithm. ICP is known to be very accurate given a good enough initial solution [9]. That was certainly the case in our experiments since we have manually chosen corresponding points very carefully. Therefore in the present context we regard such registration data as ground truth values. The basic algorithm (i.e. assuming no rotation data are known in advance) has proven successful for the registration of neighboring point clouds, but insufficiently accurate for a good registration between mutually distant views. An additional problem is a significant time that is needed for algorithm to CCVW 2013 Poster Session September 19, 2013, Zagreb, Croatia produce the results. Improvement of the system, based on reduction of search space, has proven much more successful. Table 2. shows the average deviation of the results from the ground truth values when we search corresponding points in two more distant views (i.e. we compare views 3 and 1, 4 and 2, etc.). The first row represents the results when the rotation is not known (GA1) and the second row displays the results when the rotation is known (GA2). As algorithm finds three parameters more quickly than all six parameters, there is no need for GA to run through all 500 generations, which significantly decreases an average time of execution. What is more important, we get much better results. Tables 3. and 4. show results for views 4-2 and 13-11. The first row shows ground truth values, the second row shows values found by GA when the rotation is not known and the third row shows values when the rotation is known. The column named with % displays the percentage of the corresponding points between the two clouds of points. The results for the given views can also be seen on figures Fig. 3. – Fig. 6. Green and red images represent two different views, left part of the picture is before we applied the rigid motion between the views and the right part of the picture is after the rigid motion is applied and the images are rotated and translated for the values given by GA. TABLE II. STANDRAD DEVIATION AND TIME OF EXECUTION x/mm y/mm z/mm α/° β/° ψ/° time/min GA1 71 44 169 9 49 8 27 GA2 22 19 15 known known known 8 TABLE III. RESULTS FOR VIEWS 4-2 x y z α β ψ % ground 196 9 -318 0 57 3 56 GA1 177 -77 -99 5 20 -1 28 GA2 207 5 -308 known known known 56 TABLE IV. RESULTS FOR VIEWS 13-11 x y z α β ψ % ground 330 31 -299 0 67 0 31 GA1 118 62 141 -1 -57 -8 17 GA2 359 -12 -270 known known known 21 51 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia V. Fig. 3. Views 4-2, unknown rotation CONCLUSION The surface registration is an important step in the process of 3D object reconstruction and can be seen as an optimization problem with six degrees of freedom. The objective of the registration algorithms is to find the spatial translation and the rotation between the views so that two clouds of 3D points overlap with each other properly. Genetic algorithm has proven to be a suitable method for solving this particular optimization problem and can be used both as a coarse and fine registration method. The tests have shown that reducing the search space of the optimization problem from six to three parameters leads to better results and faster execution of the algorithm. The above is important because today's acquisition systems allow us to deploy the solutions where we can get the information about the rotational angles between the different views. Therefore, our next research objective will be a 3D reconstruction system design where rotation data are readily available and the surface registration reduces to computing the translation part only. REFERENCES [1] J. Salvi,C Matabosch, D. Fofi, J. Forest, A review of recent image registration methods with accuracy evaluation, Image and Vision Computing 25, 2007, 578-596 Fig. 4. Views 4-2, assuming known rotation in advance [2] MTx Access date: August 2013. http://www.xsens.com [3] Shirmohammadi, B., Taylor, C.J., 2011. Self-localizing smart camera networks. ACM Transactions on Embedded Computing Systems, Vol. 8, pp. 1-26. [4] T. Pribanić, Y. Diez, S. Fernandez, J. Salvi. An Efficient Method for Surface Registration. International Conference on Computer Vision Theory and Applications, VISIGRAPP 2013, Barcelona Spain, 2013. 500-503 [5] K. C. Chow,H. T.Tsui, T. Lee, Surface registration using a dynamic genetic algorithm, Pattern Recognition 37 2004, 105-107 Fig. 5. Views 13-11, unknown rotation [6] L. Silva, O. Bellon, K. Boyer, Enhanced, robust genetic algorithms for multiview range image registration, in: 3DIM03. Fourth International Conference on 3-D Digital Imaging and Modeling, 2003, 268–275. [7] K. F. Man, K. S. Tang, S. Kwong, Genetic algorithms: concepts and applications, IEEE transactions on industrialelectronics, vol. 43, no. 5, 1996, 519-535 [8] M. Vanco, G. Brunnett, T. Schreiber, A hashing strategy for eOcient knearest neighbors computation, in: Proceedings of the International Conference Computer Graphics, 1999, pp. 120–128. [9] Rusinkiewicz, S., Levoy, M., 2001. Efficient variant of the ICP algorithm, in: 3rd International Conference on 3-D Digital Imaging and Modeling, pp. 145–152 Fig. 6. Views 13-11, assuming known rotation in advance CCVW 2013 Poster Session 52 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia Filtering for More Accurate Dense Tissue Segmentation in Digitized Mammograms Mario Mustra, Mislav Grgic University of Zagreb, Faculty of Electrical Engineering and Computing, Unska 3, 10000 Zagreb, Croatia [email protected] Abstract - Breast tissue segmentation into dense and fat tissue is important for determining the breast density in mammograms. Knowing the breast density is important both in diagnostic and computer-aided detection applications. There are many different ways to express the density of a breast and good quality segmentation should provide the possibility to perform accurate classification no matter which classification rule is being used. Knowing the right breast density and having the knowledge of changes in the breast density could give a hint of a process which started to happen within a patient. Mammograms generally suffer from a problem of different tissue overlapping which results in the possibility of inaccurate detection of tissue types. Fibroglandular tissue presents rather high attenuation of X-rays and is visible as brighter in the resulting image but overlapping fibrous tissue and blood vessels could easily be replaced with fibroglandular tissue in automatic segmentation algorithms. Small blood vessels and microcalcifications are also shown as bright objects with similar intensities as dense tissue but do have some properties which makes possible to suppress them from the final results. In this paper we try to divide dense and fat tissue by suppressing the scattered structures which do not represent glandular or dense tissue in order to divide mammograms more accurately in the two major tissue types. For suppressing blood vessels and microcalcifications we have used Gabor filters of different size and orientation and a combination of morphological operations on filtered image with enhanced contrast. Keywords - Gabor Filter; Breast Density; CLAHE; Morphology I. INTRODUCTION Computer-aided diagnosis (CAD) systems have become an integral part of modern medical systems and their development should provide a more accurate diagnosis in shorter time. Their aim is to help radiologists, especially in screening examinations, when a large number of patients are examined and radiologist often spend very short time for readings of non-critical mammograms. Mammograms are Xray breast images of usually high resolution with moderate to high bit-depth which makes them suitable for capturing fine details. Because of the large number of captured details, computer-aided detection (CADe) systems have difficulties in detection of desired microcalcifications and lesions in the image. Since mammograms are projection images in grayscale, it is difficult to automatically differentiate types of CCVW 2013 Poster Session breast tissue because different tissue types can have the same or very similar intensity. The problem which occurs in different mammograms is that the same tissue type is shown with a different intensity and therefore it is almost impossible to set an accurate threshold based solely on the histogram of the image. To overcome that problem different authors have came up with different solutions. Some authors used statistical feature extraction and classification of mammograms into different categories according to their density. Other approaches used filtering of images and then extracting features from filter response images to try getting a good texture descriptor which can be classified easier and provide a good class separability. Among methods which use statistical feature extraction, Oliver et al. [1] obtained very good results using combination of statistical features extracted not directly from the image, but from the gray level co-occurrence matrices. We can say that they have used 2nd order statistics because they wanted to describe how much adjacent dense tissue exists in each mammogram. Images were later classified into four different density categories, according to BI-RADS [2]. Authors who used image filtering techniques tried to divide, as precisely as possible, breast tissue into dense and fat tissue. With the accurate division of dense and fat tissue in breasts it would be possible to quantify the results of breast density classification and classification itself would become trivial because we would have numerical result in number of pixels belonging to the each of two groups. However, the task of defining the appropriate threshold for dividing breast tissue into two categories is far from simple. Each different mammogram captured using the same mammography device is being captured with slightly different parameters which will affect the final intensity of the corresponding tissues in the different image. These parameters are also influenced by the physical property of each different breast, for example its size and amount of dense and fat tissue. In image acquisition process the main objective is to produce an image with very good contrast and no clipping in both low and high intensity regions. Reasons for different intensities for the corresponding tissue, if we neglect usage of different imaging equipment, are difference in the actual breast size, difference in the compressing force applied to the breast during capturing process, different exposure time and different anode current. Having this in mind authors tried to overcome that problem by 53 Proceedings of the Croatian Computer Vision Workshop, Year 1 applying different techniques which should minimize influence of capturing inconsistencies. Muhimmah and Zwiggelaar [3] presented an approach of multiscale histogram analysis having in mind that image resizing will affect the histogram shape because of detail removal when image is being downsized. In this way they were able to remove small bright objects from images and tried to get satisfactory results by determining which objects correspond to large tissue areas. Petroudi et al. [4] used Maximum Response 8 filters [5] to obtain a texton dictionary which was used in conjunction with the support-vector machine classifier to classify the breast into four density categories. Different equipment for capturing mammograms produces resulting images which have very different properties. The most common division is in two main categories: SFM (Screen Film Mammography) and FFDM (Full-Field Digital Mammography). Tortajada et al. [6] have presented a work in which they try to compare accuracy of the same classification method on the SFM and the FFDM images. Results which they have obtained show that there is a high correlation of automatic classification and expert readings and overall results are slightly better for FFDM images. Even inside the same category, such is FFDM, captured images can be very different. DICOM standard [7] recommends bit-depth of 12 bits for mammography modality and images are stored according to the recommendation. However if we later observe the histogram of those images, it is clear that the actual bit-depth is often much lower and is usually below 10 bits. In this paper we present a method which should provide a possibility for division of the breast tissue between parenchymal tissue and fatty tissue without influence of fibrous tissue, blood vessels and fine textural objects which surround the fibroglandular disc. Segmentation of dense or glandular tissue from the entire tissue will be made by setting different thresholds. Our goal is to remove tissue which interferes with dense tissue and makes the division less accurate because non-dense tissue is being treated as dense due to its high intensity when compared to the rest of the tissue. Gabor filters generally proved to be efficient in extracting features for the breast cancer detection from mammograms because of their sensitivity to edges in different orientations [8]. Therefore, for the removal of blood vessels, we have used Gabor filter bank which is sensitive to brighter objects which are rather narrow or have high spatial frequency. Output of the entire filter bank is an image which is created from superimposed filter responses of different orientations. Subtraction of the image which represents vessels and dense tissue boundaries from the original image produces a much cleaner image which can later be enhanced in order to equalize intensity levels of the corresponding tissue types among different images. In that way we will be able to distinct dense tissue from fat more accurately. The proposed method has been tested on mammograms from the mini-MIAS database [9] which are all digitized SFMs. This paper is organized as follows. In Section II we present the idea behind image filtering using Gabor filter bank and explain which setup we will use for filtering out blood vessels CCVW 2013 Poster Session September 19, 2013, Zagreb, Croatia and smaller objects. In Section III we present results of filtering with the appropriate filter and discuss results of region growing after contrast enhancement and application of morphological operations. Section IV draws the conclusions. II. GABOR FILTERS Gabor filters are linear filters which are most commonly used for the edge detection purposes as well as the textural feature extraction. Each filter can be differently constructed and it can vary in frequency, orientation and scale. Because of that Gabor filters provide a good flexibility and orientation invariantism. Gabor filter in a complex notation can be expressed as: (x cos θ + y sin θ )2 + γ 2 ( y cos θ − x sin θ )2 G = exp − 2σ 2 2π (x cos θ + y sin θ ) ⋅ exp i + ψ , λ ⋅ (1) where θ is the orientation of the filter, γ is the spatial aspect ratio, λ is the wavelength of the sinusoidal factor, σ is the sigma or width of the Gaussian envelope and ψ is the phase offset. This gives a good possibility to create different filters which are sensitive to different objects in images. To be able to cover all possible blood vessels and small linearly shaped objects it is necessary to use more than one orientation. In our experiment we have used 8 different orientations and therefore obtained the angle resolution of 22.5°. Figure 1 (a)-(h) shows 8 different filter orientations created using (1) with the angle resolution of 22.5° between each filter, from 0 to 157.5° respectively. Besides the orientation angle, one of the most commonly changed variables in (1) is the sinusoidal frequency. Different sinusoidal frequencies will provide different sensitivity of the used filter for different spatial frequencies of objects in images. If the chosen filter contains more wavelengths, filtered image will correspond more to the original image because filters will be sensitive to objects of a high spatial frequency, e.g. details. In the case of smaller number of wavelengths, filtered image will contain highly visible edges. Figure 2 shows different wavelengths of Gabor filter with the same orientation. (a) (b) (c) (d) (e) (f) (g) (h) Figure 1. Gabor filters of the same scale and wavelength with different orientations: (a) 0°; (b) 22.5°; (c) 45°; (d) 67.5°; (e) 90°; (f) 112.5°; (g) 135°; (h) 157.5°. 54 Proceedings of the Croatian Computer Vision Workshop, Year 1 (a) September 19, 2013, Zagreb, Croatia (b) (a) (c) (b) (d) Figure 2. (a)-(d) Gabor filters which contains larger to smaller number of sinusoidal wavelengths respectively. There is of course another aspect of the filter which needs to be observed and that is the actual dimension of the filter. Dimension of the filter should be chosen carefully according to image size and the size of object which we want to filter out and should be less than 1/10 of the image size. III. IMAGE FILTERING Preprocessing of images is the first step which needs to be performed before filtering. Preprocessing steps include image registration, background suppression with the removal of artifacts and the pectoral muscle removal in the case of Medio-Lateral Oblique mammograms. For this step we have used manually drawn segmentation masks for all images in the mini-MIAS database. These masks were hand-drawn by an experienced radiologist and, because of their accuracy, can be treated as ground truth since there is no segmentation error which can occur as the output of automatic segmentation algorithms. The entire automatic mask extraction process has been described in [10] and steps for the image "mdb002" from the mini-MIAS database are shown in Figure 3. After the preprocessing we have proceeded with locating of the fibroglandular disc position in the each breast image. Fibroglandular disc is a region containing mainly two tissue types, dense or glandular and fat and according to their distribution it is possible to determine in which category according to the density some breast belong. CCVW 2013 Poster Session (c) Figure 3. (a) Original mini-MIAS image "mdb002"; (b) Registered image with removed background and pectoral muscle; (c) Extracted ROI from the same image which will be filtered using Gabor filter. Dense tissue mainly has higher intensity in mammograms because it presents higher attenuation for X-rays than fat tissue. Intensity also changes with the relative position towards edge of the breast because of the change in thickness. Since the fibroglandular disc is our region of interest, we have extracted only that part of the image. Entire preprocessing step done for all images in the mini-MIAS database is described in [11]. To extract ROI we have cropped part of the image according to the maximum breast tissue dimensions as shown in Figure 4. Actual ROI boundaries are chosen to be V and H for vertical and horizontal coordinates respectively: max(horizontal ) max(horizontal ) + max(horizontal ) V = : 2 2 (2) max(vertical ) : max(vertical ) H = 3 where max(horizontal) is the vertical coordinate of the maximal horizontal dimension, and max(vertical) is the horizontal coordinate of the maximal vertical dimension. 55 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia (a) (b) Figure 6. (a) Enhanced ROI from "mdb001" with the applied threshold; (b) Enhanced ROI from "mdb006" with the applied threshold. Figure 4. Defining maximal breast tissue dimension for ROI extraction. With this approach we get a large part or entire fibroglandular disc isolated and there is no need for the exact segmentation of it. It would be good if we could eliminate fibrous tissue and blood vessels and treat our ROI as it is completely uniform in the case of low density breasts. To be able to perform that task we can choose an appropriate Gabor filter sensitive to objects that we want to remove. A good Gabor filter for detection of objects with high spatial frequency contains less sinusoidal wavelengths, like the ones showed in Figure 2 (c) and (d). Contrast Limited Adaptive Histogram Equalization (CLAHE) [12] is a method for local contrast enhancement which is suitable for equalization of intensities in each ROI that we observe. Contrast enhancement obtained using CLAHE method will provide better intensity difference between dense and fat tissue. CLAHE method is based on a division of the image into smaller blocks to allow better contrast enhancement and at the same time uses clipping of maximal histogram components to achieve a better enhancement of mid-gray components. If we observe the same ROI before and after applying CLAHE enhancement it is clear that fat tissue can be filtered out easier after the contrast enhancement. Figure 5 shows application of the contrast enhancement using CLAHE on "mdb001". (a) (b) Figure 5. (a) Original ROI from "mdb001"; (b) Same ROI after the contrast enhancement using CLAHE. CCVW 2013 Poster Session If we apply the threshold on the enhanced ROI we will get the result for "mdb001" and "mdb006" as shown in Figure 6 (a) and (b). These two images belong to the opposite categories according to the amount of dense tissue. The applied threshold is set to 60% of the mean image intensity. After applying threshold, images have visibly different properties according to the tissue type. It is not possible to apply the same threshold because different tissue type has different intensities. Contrast enhancement makes the detection of fibrous tissue and vessels easier especially after Gabor filtering, Figure 7 (a) and (b). After contrast enhancement and filtering images using Gabor filter to remove fibrous tissue we need to make a decision in which category according to density each breast belongs. For that we will use binary logic with different threshold applied to images. We will apply two thresholds, at 60% and 80% of the maximal intensity and calculate the area contained in both situations. For that we will use logical AND operator, Figure 8. From Figure 8 (e) and (f) we can see that combination of threshold images using the logical AND for low and high density give the correct solution. Figures 8 (e) and (f) show the result of (a) AND (c) and (b) AND (d). (a) (b) Figure 7. (a) ROI from "mdb001" after applying Gabor filter; (b) ROI from "mdb006" after applying Gabor filter. 56 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia blood vessels or small fibrous tissue segments and contrast enhancement provides a comparability of the same tissue type in different mammograms. Our future work in this field will be development of automatic segmentation algorithms for dense tissue in order to achieve a quantitative breast density classification by knowing the exact amount of the dense tissue. ACKNOWLEDGMENT (a) (b) The work described in this paper was conducted under the research project "Intelligent Image Features Extraction in Knowledge Discovery Systems" (036-0982560-1643), supported by the Ministry of Science, Education and Sports of the Republic of Croatia. REFERENCES [1] (c) (d) (e) (f) Figure 8. (a) "mdb001" after filtering out blood vessels and threshold at 60% of maximal intensity; (b) "mdb006" after filtering out blood vessels and threshold at 60% of maximal intensity; (c) "mdb001", threshold at 80%; (d) "mdb006", threshold at 80%; (e) "mdb001" threshold at 60% AND threshold at 80%; (f) "mdb006" threshold at 60% AND threshold at 80%. IV. CONCLUSION In this paper we have presented a method which combines the usage of local contrast enhancement biased on the mid-gray intensities. The contrast enhancement has been achieved using CLAHE method. Gabor filters have been used for the removal of blood vessels and smaller portions of fibrous tissue which has similar intensity as dense tissue. Combination of different thresholds in conjunction with the logical AND operator provided a good setup for determining whether we have segmented a fat or dense tissue and therefore can give as a rule for segmenting each mammogram with different threshold, no matter what the overall tissue intensity is. The advantage of Gabor filter over classical edge detectors is in easy orientation changing and possibility to cover all possible orientations by superpositioning filter responses. Usage of Gabor filters improves the number of false positive results which come from CCVW 2013 Poster Session A. Oliver, J. Freixenet, R. Martí, J. Pont, E. Pérez, E.R.E. Denton, R. Zwiggelaar, "A Novel Breast Tissue Density Classification Methodology", IEEE Transactions on Information Technology in Biomedicine, Vol. 12, Issue 1, January 2008, pp. 55-65. [2] American College of Radiology, "Illustrated Breast Imaging Reporting and Data System (BI-RADS) Atlas", American College of Radiology, Third Edition, 1998. [3] I. Muhimmah, R. Zwiggelaar, "Mammographic Density Classification using Multiresolution Histogram Information", Proceedings of the International Special Topic Conference on Information Technology in Biomedicine, ITAB 2006, Ioannina - Epirus, Greece, 26-28 October 2006, 6 pages. [4] S. Petroudi, T. Kadir, M. Brady, "Automatic Classification of Mammographic Parenchymal Patterns: a Statistical Approach", Proceedings of the 25th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Cancun, Mexico, Vol. 1, 17-21 September 2003, pp. 798-801. [5] M. Varma and A. Zisserman, "Classifying images of materials: Achieving viewpoint and illumination independence", Proceedings of the European Conference on Computer Vision, Copenhagen, Denmark, 2002, pp. 255–271. [6] M. Tortajada, A. Oliver, R. Martí, M. Vilagran, S. Ganau, . Tortajada, M. Sentís, J. Freixenet, "Adapting Breast Density Classification from Digitized to Full-Field Digital Mammograms", Breast Imaging, Springer Berlin Heidelberg, 2012, pp. 561-568. [7] Digital Imaging and Communications in Medicine (DICOM), NEMA Publications, "DICOM Standard", 2011, available at: http://medical.nema.org/Dicom/2011/ [8] Y. Zheng, "Breast Cancer Detection with Gabor Features from Digital Mammograms", Algorithms 3, No. 1, 2010, pp. 44-62. [9] J. Suckling, J. Parker, D.R. Dance, S. Astley, I. Hutt, C.R.M. Boggis, I. Ricketts, E. Stamatakis, N. Cernaez, S.L. Kok, P. Taylor, D. Betal, J. Savage, "The Mammographic Image Analysis Society Digital Mammogram Database", Proceedings of the 2nd International Workshop on Digital Mammography, York, England, pp. 375-378, 10-12 July 1994. [10] M. Mustra, M. Grgic, R. Huzjan-Korunic, "Mask Extraction from Manually Segmented MIAS Mammograms", Proceedings ELMAR2011, Zadar, Croatia, September 2011, pp. 47-50. [11] M. Mustra, M. Grgic, J. Bozek, "Application of Gabor Filters for Detection of Dense Tissue in Mammograms", Proceedings ELMAR2012, Zadar, Croatia, September 2012, pp. 67-70. [12] K. Zuiderveld, "Contrast Limited Adaptive Histogram Equalization", Graphic Gems IV. San Diego, Academic Press Professional, 1994, pp. 474-485. 57 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia Flexible Visual Quality Inspection in Discrete Manufacturing Tomislav Petković, Darko Jurić and Sven Lončarić University of Zagreb Faculty of Electrical and Computer Engineering Unska 3, HR-10000 Zagreb, Croatia Email: {tomislav.petkovic.jr, darko.juric, sven.loncaric}@fer.hr Abstract—Most visual quality inspections in discrete manufacturing are composed of length, surface, angle or intensity measurements. Those are implemented as end-user configurable inspection tools that should not require an image processing expert to set up. Currently available software solutions providing such capability use a flowchart based programming environment, but do not fully address an inspection flowchart robustness and can require a redefinition of the flowchart if a small variation is introduced. In this paper we propose an acquire-register-analyze image processing pattern designed for discrete manufacturing that aims to increase the robustness of the inspection flowchart by consistently addressing variations in product position, orientation and size. A proposed pattern is transparent to the end-user and simplifies the flowchart. We describe a developed software solution that is a practical implementation of the proposed pattern. We give an example of its real-life use in industrial production of electric components. I. I NTRODUCTION Automated visual quality inspection is the process of comparing individual manufactured items against some preestablished standard. It is frequently used both for small and high-volume assembly lines due to low costs and nondestructiveness. Also, ever increasing demands to improve quality management and process control within an industrial environment are promoting machine vision and visual quality inspection with the goal of increasing the product quality and production yields. When designing a machine vision system many different components (camera, lens, illumination and software) must be selected. A machine vision expert is required to select the system components based on requirements of the inspection task by specifying [1]: 1) 2) 3) 4) Camera: type (line or area), field of view, resolution, frame rate, sensor type, sensor spectral range, Lens: focal length, aperture, flange distance, sensor size, lens quality, Illumination: direction, spectrum, polarization, light source, mechanical adjustment elements, and Software: libraries to use, API ease of use, software structure, algorithm selection. Selected components are then used to measure length, surface, angle or intensity of the product. Based on measurements a decision about quality of an inspected product is made. Although software is a small part of an quality inspection CCVW 2013 Poster Session system usually, in addition to a machine vision expert, an image processing expert is required to select the software and to define the image processing algorithms that will be used. Nowadays, introduction of GigE Vision [2] and GenICam [3] standards significantly simplified integration of machine vision cameras and image processing libraries making vision solutions for the industry more accessible by eliminating the need for proprietary software solutions for camera interaction. Existing open source image processing libraries especially suitable for development of vision applications, such as OpenCV (Open Source Computer Vision Library, [4]), provide numerous algorithms that enable rapid development of visual quality inspection systems and simplify software selection, however, specific details of image processing chain for any particular visual inspection must be defined by an image processing expert on a case-by-case basis, especially if both robust and flexible solution is required. To eliminate the need for the image processing expert stateof-the art image processing algorithms are bundled together into specialized end-user configurable measurement tools that can be easily set-up. Such tools should be plugins or modules of a larger application where image acquisition, image display, user interface and process control parts are shared/reused. Usually a graphical user interface is used where flowcharts depict image processing pipeline, e.g. NI Vision Builder [5]. Such environments provide many state-of-the art image processing algorithms that can be chained together or are pre-assembled into measurement tools, but there is no universally accepted way of consistently addressing problems related to variations in product position, orientation and size in the acquired image. To further reduce the need for an image processing expert a registration step that would remove variation due to product position, orientation and size should be introduced. Included registration step would enable simpler deployment of more robust image processing solutions in discrete manufacturing1 . In this paper we propose an acquire-register-analyze image processing pattern that is especially suited to discrete manufacturing. Proposed acquire-register-analyze pattern aims to increase reproducibility of an image processing flowchart by consistently addressing variations in product position, orientation and size through the registration step that is implemented in a way transparent to the end-user. 1 Discrete manufacturing is production of any distinct items capable of being easily counted. 58 Proceedings of the Croatian Computer Vision Workshop, Year 1 II. I MAGE P ROCESSING PATTERNS For discrete manufacturing once an image is acquired processing is usually done in two steps [6], first is object localization that is followed by object scrutiny and measurement. More detailed structure of an image processing pipeline is given in [1], where typical image processing steps are listed: 1) 2) 3) 4) 5) 6) 7) image acquisition, image preprocessing, feature or object localization, feature extraction, feature interpretation, generation of results, and handling interfaces. A. Acquire-Analyze Pattern Seven typical image processing steps can be decomposed as follows: Image preprocessing step includes image correction techniques such as illumination equalization or distortion correction. Today this is usually done by the acquisition device itself, e.g. GenICam standard [3] requires tools that are sufficient for intensity or color correction. Feature or object localization is a first step of the image processing chain and is usually selected on a case-by-case basis to adjust for variations is object positioning. Feature extraction utilizes typical processing techniques such as edge detection, blob extraction, pattern matching and texture analysis. Results of the feature extraction step are interpreted in the next step to make actual measurements that are used to generate results and to handle interfaces of an assembly line. We call this image processing pattern acquire-analyze as feature or object localization must be repeated for (almost) every new feature of interest in the image2 . A structure of this pattern is shown in Fig. 1. Image acquisition image image processing chain Preprocessing image Feature localization regions of interest Feature extraction feature image Feature interpretation measurements Generation of results Fig. 1. Acquire-analyze image processing pattern. 2 This holds for discrete manufacturing. For other typed of manufacturing processes registration is not always necessary, e.g. in fabric production feature of interest is texture making registration unnecessary. CCVW 2013 Poster Session September 19, 2013, Zagreb, Croatia Shortcomings of acquire-analyze pattern in discrete manufacturing are: a) image processing elements must be chosen and chained together so variation in product placement does not affect the processing, b) processing can be sensitive to camera/lens repositioning or replacement during lifetime of the assembly line, and c) for complex inspections image processing chain requires an image processing expert. Processing can be more robust and the image processing chain easier to define if a registration step is introduced in a way that makes results of feature localization less dependent on variations in position. When introducing a registration step several requirements must be fulfilled. Firstly, registration must be introduced in a way transparent to the end user, and, secondly, registration must be introduced so unneeded duplication of the image data is avoided. B. Acquire-Register-Analyze Pattern A common usage scenario for visual inspection software in discrete manufacturing assumes the end-user who defines all required measurements and their tolerances in a reference or source image of the inspected product. Inspection software then must automatically map the image processing chain and required measurements from the reference or source image to the image of the inspected product, the target image, by registering two images [7]. End result of the registration step is: a) removal of variation due to changes in position of the product, b) no unnecessary image data is created as no image transforms are performed, instead image processing algorithms are mapped to the target image, c) as only image processing algorithms are mapped overall mapping can be significantly faster then if the target image data is mapped to the source image, and d) no image interpolation is needed which can lead to overall better performance as any interpolation artifacts are eliminated by the system design. In discrete manufacturing all products are solid objects so feature based global linear rigid transform is usually sufficient to map the defined image processing chain from the source to the target image; that is, compensating translation, rotation and scaling by using simple homography is sufficient. Required mapping transform is defined by a 4 × 4 matrix T [8]. This transform matrix must be transparently propagated through the whole processing chain. If a graphical user interface for defining the processing chain is used this means all processing tools must accept a transform matrix T as an input parameter that can be optionally hidden from the end user to achieve transparency. We call this image processing pattern acquire-registeranalyze as feature or object localization is done once at the beginning of the pipeline to find the transform T. A structure of this pattern is shown in Fig. 2. III. S OFTWARE D ESIGN AND I MPLEMENTATION To test the viability and usefulness of the proposed acquireregister-analyze pattern a visual inspection software was constructed. Software was written in C++ and C# programming languages for the .Net platform and is using Smartek GigEVision SDK [9] for image acquisition, and OpenCV [4] and Emgu CV [10] for implementing image processing tools. 59 Proceedings of the Croatian Computer Vision Workshop, Year 1 the end-user, e.g. edge extraction, ridge extraction, object segmentation, thresholding, blob extraction etc. This is achieved by representing all measurements by rectangular blocks. Inputs are always placed on the left side of the block while outputs are always placed on the right side of the block. Name is always shown above the block and any optional control parameters are shown on the bottom side of the block. Several typical blocks are shown in Fig. 4. Image acquisition image Preprocessing image image Feature localization regions of interest Feature extraction feature image Feature interpretation measurements image processing chain global linear rigid transform Registration Fig. 2. September 19, 2013, Zagreb, Croatia Acquire-register-analyze image processing pattern. Graphical user interface (Fig. 3) and logic to define an image processing chain was written from scratch using C#. A. User Interface Interface of the inspection software must be designed to make various runtime task simple. Typical runtime tasks in visual inspection are [1]: a) monitoring, b) changing the inspection pipeline, c) system calibration, d) adjustment and tweaking of parameters, e) troubleshooting, f) re-teaching, and g) optimizing. Those various tasks require several different interfaces so the test application was designed as a tabbed interface with four tabs, first for immediate inspection monitoring and tweaking, second for overall process monitoring, third for system calibration and fourth for definition of the image processing chain (Fig. 3). Fig. 4. Example for four image processing tools. Inputs are always on the left side, outputs are always on the right side and optional parameters are on the bottom of the block. B. Registration Two blocks that are present in every inspection flowchart are input and output blocks. For the acquire-register-analyze pattern first block in the flowchart following the input block is a registration block that internally stores the reference or source image. For every target image the transform is automatically propagated to all following blocks. Note that actual implementation of the registration is not fixed, any technique that enables recovery of simple homography is acceptable. Using OpenCV [4] a simple normalized cross-correlation or more complex registration based on key-point detectors such as FAST, BRISK or ORB can be used. Figs. 5 and 6 show the source and the target images with displayed regions of interest where lines are extracted for the angle measurement tool. User can transparently switch between the source and the target as the software automatically adjusts the processing tool-chain. C. Annotations and ROIs One additional problem when introducing registration transparent to the end-user are user defined regions-of-interest (ROIs) and annotations that show measurement and inspection results on a non-destructive overlay. Fig. 3. User interface for defining the image processing chain by flowchart. User interface for defining an image processing tool-chain must be composed so the end-user is able to intuitively draw any required measurement in the reference image. Interface should also hide most image processing steps that are required to compute the measurement and that are not the expertise of CCVW 2013 Poster Session Every user-defined ROI is usually a rectangular part of the reference or source image that has new local coordinate system that requires additional transform matrix so image processing chain can be mapped first from the source to the target, and then from the target to the ROI. Composition of transforms achieves the desired mapping. However, as all image processing tools are mapped to the input image or to defined ROIs and as there can be many different mapping transforms the inspection result cannot be displayed as a nondestructive overlay unless a transform from local coordinate 60 Proceedings of the Croatian Computer Vision Workshop, Year 1 system to the target image is known3 . This transform to achieve correct overlay display can also be described by a 4 × 4 matrix D that must be computed locally and propagated every time a new ROI is introduced in the image processing chain. So for the acquire-register-analyze pattern elements of the image processing chain can be represented as blocks that must accept T and D as inputs/outputs that are automatically connected by the software (Fig. 4), and that connections can be be auto-hidden depending on the experience of the end-user. September 19, 2013, Zagreb, Croatia with a registration block that registers the prerecorded source image to the target image and automatically propagates the registration results through the processing chain. Developed prototype software was accepted by the engineers and staff in the plant and has successfully completed the testing stage and is now used in regular production. A. Contact Alignment Inspection There are two automated rotary tables with 8 nests for welding where two metal parts are welded to form the spring contact. There are variations in product placement between nests and variations in the input image due to differences in camera placement above the two rotary tables (see Fig. 8). Image processing chain is composed of the registration step that automatically adjusts position of subsequent measurement tools that measure user defined distances and angles and compare them to the product tolerances. This proposed design enables one product reference image to be used where required product dimensions are specified and that is effectively shared between production lines thus reducing the engineering overhead as only one processing pipeline must be maintained. Fig. 5. Tool’s ROIs for line extraction and angle measurement shown in the source image. IV. R ESULTS Developed prototype software using the proposed acquireregister-analyze pattern is in production use at the ElektroKontakt d.d. Zagreb plant. The plant manufactures electrical switches and energy regulators for electrical stoves (discrete manufacture). Observed variations in acquired images that must be compensated using registration are caused by product movement due to positioning tolerances on a conveyor and by changes in a camera position4 . Regardless of the cause the variation in a product placement with respect to the original position in the source image can be computed. Thus every used image processing chain starts 3 Note that this is NOT a transform back to the source image. 4 After regular assembly line maintenance camera position is never perfectly reproducible. CCVW 2013 Poster Session Fig. 8. Rotary table for contact welding. Second assembly line Fig. 6. Tool’s ROIs for line extraction and angle measurement shown in the target image. First assembly line Fig. 7. Variations in product placement. B. Energy Regulator Inspection There are four assembly lines where an energy regulator for electrical stoves is assembled. At least seventeen different product subtypes (Fig. 3) are manufactured. There is a variation in product placement due to positioning tolerances on the 61 Proceedings of the Croatian Computer Vision Workshop, Year 1 conveyor belt and due to variations in cameras’ positions across all four assembly lines. Defined inspection tool-chains must be shareable across the lines to make the system maintenance simple. September 19, 2013, Zagreb, Croatia [5] [6] [7] [8] [9] [10] (2013, Aug.) NI Vision Builder for Automated Inspection (AI). [Online]. Available: http://www.ni.com/vision/vbai.htm E. R. Davies, Computer and Machine Vision, Fourth Edition: Theory, Algorithms, Practicalities, 4th ed. Academic Press, 2012. L. G. Brown, “A survey of image registration techniques,” ACM Comput. Surv., vol. 24, no. 4, pp. 325–376, Dec. 1992. [Online]. Available: http://doi.acm.org/10.1145/146370.146374 J. D. Foley, A. van Dam, S. K. Feiner, J. F. Hughes, and R. L. Phillips, Introduction to Computer Graphics, 1st ed. Addison-Wesley Professional, Aug. 1993. (2013, Aug.) GigEVisionSDK Camera Acquisition PC-Software. [Online]. Available: http://www.smartekvision.com/downloads.php (2013, Aug.) Emgu CV .Net wrapper for OpenCV. [Online]. Available: http://www.emgu.com/ Fig. 9. Four different product variants (body and contact shape) of an energy regulator recorded on the same assembly line. Large number of product variants requires the user interface that enables a flexible adaptation and tweaking of the processing pipeline. Here the graphical programming environment (Fig. 3) was invaluable as it enables the engineer to easily redefine the measurements directly on the factory floor. Furthermore, due to the registration step only one reference image per product subtype is required so only seventeen different image processing chains must be tested and maintained. V. C ONCLUSION In this paper we have presented an acquire-register-analyze image processing pattern that aims to increase the robustness of an image processing flowchart by consistently addressing variations in product position, orientation and size. Benefits of the proposed pattern are: a) no unnecessary image data is created from the input image by the application during image processing, b) overall speed and throughput of the image processing pipeline is increased, and c) interpolation artifacts are avoided. We have demonstrated the feasibility of the proposed pattern through case-studies and real-world use at the ElektroKontakt d.d. Zagreb plant. ACKNOWLEDGMENT The authors would like to thank Elektro-Kontakt d.d. Zagreb and mr. Ivan Tabaković for provided invaluable support. R EFERENCES [1] A. Hornberg, Handbook of Machine Vision. Wiley-VCH, 2006. [2] (2013, Aug.) GiGE Vision standard. [Online]. Available: http: //www.visiononline.org/vision-standards-details.cfm?type=5 [3] (2013, Aug.) GenICam standard. [Online]. Available: http://www. genicam.org/ [4] (2013, Aug.) OpenCV (Open Source Computer Vision Library). [Online]. Available: http://opencv.org/ CCVW 2013 Poster Session 62 Author Index B Banić, N. . . . . . . . . . . . . . . . . . . . . . 3 Brkić, K. . . . . . . . . . . . . . . . . . . 9, 19 J Jurić, D. . . . . . . . . . . . . . . . . . . . . . 58 C Cupec, R. . . . . . . . . . . . . . . . . . . . . 31 Kalafatić, Z. Kitanov, A. Kovačić, K. . Krešo, I. . . . F Filko, D. . . . . . . . . . . . . . . . . . . . . . 31 . . . . . . . . . . . . . . . . K ... ... ... ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 31 25 43 Petković, T. Petrović, I. . Pinz, A. . . . Pribanić, T. . . . . . . . . . . . . . . . . P ... ... ... ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 31 . 9 49 R Rašić, S. . . . . . . . . . . . . . . . . . . . . . . 9 Ribarić, S. . . . . . . . . . . . . . . . . . . . 15 G Gold, H. . . . . . . . . . . . . . . . . . . . . . 25 Grgić, M. . . . . . . . . . . . . . . . . . . . . 53 L Lončarić, S. . . . . . . . . . . . . . . . . 3, 58 Lopar, M. . . . . . . . . . . . . . . . . . . . . 15 H Hrgetić, V. . . . . . . . . . . . . . . . . . . . 49 M Muštra, M. . . . . . . . . . . . . . . . . . . . 53 S Šegvić, S. . . . . . . . . . . . . . 9, 19, 37, 43 Ševrović, M. . . . . . . . . . . . . . . . . . . 43 Sikirić, I. . . . . . . . . . . . . . . . . . . . . 19 I Ivanjko, E. . . . . . . . . . . . . . . . . . . . 25 N Nyarko, E. K. . . . . . . . . . . . . . . . . . 31 Z Zadrija, V. . . . . . . . . . . . . . . . . . . . 37 Proceedings of the Croatian Computer Vision Workshop, Year 1 September 19, 2013, Zagreb, Croatia 63 Proceedings of the Croatian Computer Vision Workshop CCVW 2013, Year 1 Publisher: University of Zagreb Faculty of Electrical Engineering and Computing, Unska 3, HR-10000 Zagreb, Croatia ISSN 1849-1227 http://www.fer.unizg.hr/crv/proceedings
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
advertisement