Proceedings of the 2nd Croatian Computer Vision Workshop (CCVW 2013) [24.15 MiB]

Proceedings of the 2nd Croatian Computer Vision Workshop (CCVW 2013) [24.15 MiB]
Second Croatian
Computer Vision Workshop
September 19, 2013, Zagreb, Croatia
2 0 1 3
University of Zagreb
Center of Excellence
for Computer Vision
Proceedings of the Croatian Computer Vision Workshop
CCVW 2013, Year 1
Publisher: University of Zagreb Faculty of Electrical Engineering and Computing, Unska 3, HR-10000 Zagreb, Croatia
ISSN 1849-1227
CCVW 2013
Proceedings of the Croatian Computer Vision Workshop
Zagreb, Croatia, September 19, 2013
S. Lončarić, S Šegvić (Eds.)
Organizing Institution
Center of Excellence for Computer Vision,
Faculty of Electrical Engineering and Computing,
University of Zagreb, Croatia
Technical Co-Sponsors
IEEE Croatia Section
IEEE Croatia Section Computer Society Chapter
IEEE Croatia Section Signal Processing Society Chapter
Smartek d.o.o.
AVL–AST d.o.o.
Peek promet d.o.o.
PhotoPay Ltd.
Proceedings of the Croatian Computer Vision Workshop
CCVW 2013, Year 1
Sven Lončarić ([email protected])
University of Zagreb Faculty of Electrical Engineering and Computing
Unska 3, HR-10000, Croatia
Siniša Šegvić ([email protected])
University of Zagreb Faculty of Electrical Engineering and Computing
Unska 3, HR-10000, Croatia
Production, Publishing and Cover Design
Tomislav Petković ([email protected])
University of Zagreb Faculty of Electrical Engineering and Computing
Unska 3, HR-10000, Croatia
University of Zagreb Faculty of Electrical Engineering and Computing
Unska 3, HR-10000 Zagreb, OIB: 57029260362
c 2013 by the University of Zagreb.
Copyright This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
ISSN 1849-1227
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
On behalf of the Organizing Committee it is my pleasure to invite you to Zagreb for the 2nd Croatian Computer Vision
Workshop. The objective of the Workshop is to bring together professionals from academia and industry in the area
of computer vision theory and applications in order to foster research and encourage academia-industry collaboration
in this dynamic field. The Workshop program includes oral and poster presentations of original peer reviewed research
from Croatia and elsewhere. Furthermore, the program includes invited lectures by distinguished international researchers
presenting state-of-the-art in computer vision research. Workshop sponsors will provide perspective on needs and activities
of the industry. Finally, one session shall be devoted to short presentations of activities at Croatian research laboratories.
The Workshop is organized by the Center of Excellence for Computer Vision, which is located at the Faculty of Electrical
Engineering and Computing (FER), University of Zagreb. The Center joins eight research laboratories at FER and research
laboratories from six constituent units of the University of Zagreb: Faculty of Forestry, Faculty of Geodesy, Faculty of
Graphic Arts, Faculty of Kinesiology, Faculty of Mechanical Engineering and Naval Architecture, and Faculty of Transport
and Traffic Sciences.
Zagreb is a beautiful European city with many cultural and historical attractions, which I am sure all participants will
enjoy. I look forward to meet you all in Zagreb for the 2nd Croatian Computer Vision Workshop.
September 2013
Sven Lončarić, General Chair
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
The 2013 2nd Croatian Computer Vision Workshop (CCVW) is the result of the committed efforts of many volunteers.
All included papers are results of dedicated research. Without such contribution and commitment this Workshop would
not have been possible.
Program Committee members and reviewers have spent many hours reviewing submitted papers and providing extensive
reviews which will be an invaluable help in future work of collaborating authors. Managing the electronic submissions
of the papers, the preparation of the abstract booklet and of the online proceedings also required substantial effort and
dedication that must be acknowledged. The Local Organizing Committee members did an excellent job to guarantee a
successful outcome of the Workshop.
We are grateful to the Technical Co-Sponsors, who helped us in granting the high scientific quality of the presentations,
and to the Donators that financially supported this Workshop.
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
Organizing Committee
Reviewers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CCVW 2013 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Oral Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
N. Banić, S. Lončarić
Using the Random Sprays Retinex Algorithm for Global Illumination Estimation . . . . . . . . . . .
K. Brkić, S. Rašić, A. Pinz, S. Šegvić, Z. Kalafatić
Combining Spatio-Temporal Appearance Descriptors and Optical Flow for Human Action Recognition in Video Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
M. Lopar, S. Ribarić
An Overview and Evaluation of Various Face and Eyes Detection Algorithms for Driver Fatigue
Monitoring Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
I. Sikirić, K. Brkić, S. Šegvić
Classifying Traffic Scenes Using The GIST Image Descriptor . . . . . . . . . . . . . . . . . . . . . . 19
Poster Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
K. Kovačić, E. Ivanjko, H. Gold
Computer Vision Systems in Road Vehicles: A Review . . . . . . . . . . . . . . . . . . . . . . . . . . 25
R. Cupec, E. K. Nyarko, D. Filko, A. Kitanov, I. Petrović
Global Localization Based on 3D Planar Surface Segments . . . . . . . . . . . . . . . . . . . . . . . . 31
V. Zadrija, S. Šegvić
Multiclass Road Sign Detection using Multiplicative Kernel . . . . . . . . . . . . . . . . . . . . . . . 37
I. Krešo, M. Ševrović, S. Šegvić
A Novel Georeferenced Dataset for Stereo Visual Odometry . . . . . . . . . . . . . . . . . . . . . . . 43
V. Hrgetić, T. Pribanić
Surface Registration Using Genetic Algorithm in Reduced Search Space . . . . . . . . . . . . . . . . 49
M. Muštra, M. Grgić
Filtering for More Accurate Dense Tissue Segmentation in Digitized Mammograms . . . . . . . . . . 53
T. Petković, D. Jurić, S. Lončarić
Flexible Visual Quality Inspection in Discrete Manufacturing . . . . . . . . . . . . . . . . . . . . . . 58
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
Organizing Committee
General Chair
Sven Lončarić, University of Zagreb, Croatia
Technical Program Committe Chair
Siniša Šegvić, University of Zagreb, Croatia
Local Arrangements Chair
Zoran Kalafatić, University of Zagreb, Croatia
Publications Chair
Tomislav Petković, University of Zagreb, Croatia
Technical Program Committee
Bart Bijnens, Spain
Hrvoje Bogunović, USA
Mirjana Bonković, Croatia
Karla Brkić, Croatia
Andrew Comport, France
Robert Cupec, Croatia
Albert Diosi, Slovakia
Hrvoje Dujmić, Croatia
Ivan Fratrić, Switzerland
Mislav Grgić, Croatia
Sonja Grgić, Croatia
Edouard Ivanjko, Croatia
Zoran Kalafatić, Croatia
Stanislav Kovačič, Slovenia
Zoltan Kato, Hungary
Josip Krapac, Croatia
Matej Kristan, Slovenia
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
Sven Lončarić, Croatia
Lidija Mandić, Croatia
Thomas Mensink, Netherlands
Vladan Papić, Croatia
Tomislav Petković, Croatia
Ivan Petrović, Croatia
Thomas Pock, Austria
Tomislav Pribanić, Croatia
Arnau Ramisa, Spain
Gianni Ramponi, Italy
Anthony Remazeilles, Spain
Slobodan Ribarić, Croatia
Andres Santos, Spain
Damir Seršić, Croatia
Darko Stipaničev, Croatia
Marko Subašić, Croatia
Federico Sukno, Ireland
Albert Diosi, Austraila
Andres Santos, Spain
Andrew Comport, France
Anthony Remazeilles, France
Arnau Ramisa, Spain
Axel Pinz, Austria
Bart Bijnens, Spain
Darko Stipaničev, Croatia
Edouard Ivanjko, Croatia
Emmanuel Karlo Nyarko, Croatia
Federico Sukno, Argentina
Hrvoje Bogunović, USA
Hrvoje Dujmić, Croatia
Ivan Fratrić, Croatia
Ivan Marković, Croatia
Josip Krapac, Croatia
Josip Ćesić, Croatia
Karla Brkić, Croatia
Lidija Mandić, Croatia
Marko Subašić, Croatia
Matej Kristan, Slovenia
Maxime Meilland, France
Mirjana Bonković, Croatia
Mislav Grgić, Croatia
Robert Cupec, Croatia
Siniša Šegvić, Croatia
Slobodan Ribarić, Croatia
Sonja Grgić, Croatia
Stanislav Kovačič, Slovenia
Thomas Mensink, The Netherlands
Thomas Pock, Austria
Tomislav Petković, Croatia
Tomislav Pribanić, Croatia
Vladan Papić, Croatia
Zoltan Kato, Hungary
Zoran Kalafatić, Croatia
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
Using the Random Sprays Retinex algorithm for
global illumination estimation
Nikola Banić and Sven Lončarić
University of Zagreb, Faculty of Electrical Engineering and Computing
Unska 3, 10000 Zagreb, Croatia
E-mail: {nikola.banic, sven.loncaric}
Abstract—In this paper the use of Random Sprays Retinex
(RSR) algorithm for global illumination estimation is proposed
and its feasibility tested. Like other algorithms based on the
Retinex model, RSR also provides local illumination estimation
and brightness adjustment for each pixel and it is faster than
other path-wise Retinex algorithms. As the assumption of the
uniform illumination holds in many cases, it should be possible
to use the mean of local illumination estimations of RSR as
a global illumination estimation for images with (assumed)
uniform illumination allowing also the accuracy to be easily
measured. Therefore we propose a method for estimating global
illumination estimation based on local RSR results. To our best
knowledge this is the first time that RSR algorithm is used to
obtain global illumination estimation. For our tests we use a
publicly available color constancy image database for testing.
The results are presented and discussed and it turns out that
the proposed method outperforms many existing unsupervised
color constancy algorithms. The source code is available at constancy/.
information (HLVI) [7], probabilistic algorithms [8] and combination of existing methods [9]. The result of all mentioned algorithms is a single vector representing the global illumination
estimation which is than used in chromatic adaptation to create
an appearance of another desired illumination. Algorithms like
those based on gamut mapping have the advantage of greater
accuracy, but they need to be trained first. Some simpler and
unsupervised algorithms like Gray-world or Gray-edge are of
lesser accuracy, but are easy to implement and have a low
computation cost.
Keywords—white balance, color constancy, Retinex, Random
Sprays Retinex, sampling
I(λ)ρc (λ)dλ
and it represents the illumination estimation.
Examples of color constancy algorithms include the grayworld algorithm [2], shades of gray [3], grey-edge [4], gamut
mapping [5], using neural networks [6], using high-level visual
CCVW 2013
Oral Session
Fig. 1: Same object under different illuminations.
A well known feature of human vision system (HVS) is its
ability to recognize the color of objects under variable illumination when this color depends on the color of the light source
[1]. An example of an object under different light sources
can be seen on Fig. 1. This feature is called color constancy
and achieving it computationally can significantly enhance the
quality of digital images. Even though the HVS has generally
no problem with it, computational color constancy is an illposed problem and assumptions have to be made for color
constancy algorithms. Some of these assumptions include color
distribution, uniform illumination, presence of white patches
etc. After taking the Lambertian and one single light source
assumption, the dependence of observed color of the light
source e on the light source I(λ) and camera sensitivity
function ρ(λ), which are both unknown, can be written as
A number of algorithms like [10] estimate the illumination
locally and than combine multiple local results into one
global thus also producing the global illumination estimation.
In this paper we propose a similar method based on the
Random Sprays Retinex (RSR) [11], an algorithm of Retinex
model, which deals with locality of color perception [12],
a phenomenon by which the HVS’s perception of colors
is influenced by the colors of adjacent areas in the scene.
The algorithms of the Retinex model provide local white
balancing and brightness adjustment producing so an enhanced
image and not a single vector. If the assumption of uniform
illumination is taken, then a single illumination estimation
vector can be created from combined local estimations. RSR
was chosen because it has the advantage of being faster than
other path-wise Retinex algorithms [11].
This is the structure of the paper: in Section II a simple
explanation of the Random Sprays Retinex algorithms is given,
in Section III our proposed method for global illumination
estimation is explained and in Section IV the evaluation results
are presented and discussed.
The Random Sprays Retinex algorithm was developed
by taking into consideration the mathematical description of
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
Fig. 2: Examples of images for various RSR parameters: (a) original image from the ColorChecker database,
(b) N = 1, n = 16, (c) N = 5, n = 20, (d) N = 20, n = 400.
Retinex provided in [13]. After simplifying the initial model,
it can be proved that the lightness value of pixel i for a given
channel can be calculated by using this formula
1 X I(i)
L(i) =
I(xHk )
where I(i) is the initial intensity of pixel i, N is the number
of paths and xHk is the index of the pixel with the highest
intensity along the kth path.
The next step towards RSR in [11] is to notice three reasons
for which paths should be replaced with something else: they
are redundant, their ordering is completely unimportant and
they have inadequate topological dimension. This leads to
use of 2-D objects as representations of pixel neighbourhood,
which is taken into consideration when calculating the new
pixel intensity. Random sprays are finally chosen as these 2-D
objects leading to several parameters that need to be tuned.
The value of the spray radius is set to be equal to the
diagonal of the image. The identity function is taken as the
radial density function. The minimal number of sprays (N )
and the minimal number of pixels per spray (n) representing a
trade-off between results quality and computation cost were
determined to be 20 and 400 respectively [11]. Fig. 2(a)
shows a test image from the ColorChecker image database
CCVW 2013
Oral Session
[8]. RSR processed images for various parameters are shown
on Fig. 2(b), Fig. 2(c) and Fig. 2(d).
A. Basic idea
From the original image and the RSR resulting image for
a pixel i it is possible to calculate its relative intensity change
for a given channel c using the equation
Cc (i) =
Ic (i)
Rc (i)
where Cc (i) is the intensity change of pixel i for channel c,
Ic (i) the original intensity and Rc (i) the intensity obtained
by RSR. Considering the way it is calculated, the vector
p(i) = [Cr (i), Cg (i), Cb (i)] composed of one pixel intensity
change element for each channel can be interpreted as RSR
local illumination estimation and since it is not necessarily
normalized, it also represents the local brightness adjustment.
That means that p(i) can also be written as p(i) = w(i)p̂(i)
where w(i) = kp(i)k is the norm of p(i) and p̂(i) is the unit
vector with the same direction as p(i). The merged result of
Eq. 3 for all channels calculated by using the RSR result from
Fig. 2(b) is shown on Fig. 3(a).
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
Fig. 3: (a) Local intensity changes for image shown on Fig. 2(a). (b) Local intensity changes after blurring the original and the
RSR image.
As it is obvious that for some cases like the one shown
on Fig. 3(a) there is a higher level of visible noise when
visualising the intensity changes, it might be a good thing
to try to lessen the noise in some way before further using
the calculated changes. A good way of doing so is to apply a
modification to Eq. 3:
Cc,k (i) =
(Ic ∗ k)(i)
(Rc ∗ k)(i)
where k is a chosen kernel and ∗ is the convolution operator.
By applying Eq. 4 with an averaging kernel of size 25 × 25
instead of Eq. 3, the result shown on Fig. 3(a) turns into the
one shown on Fig. 3(b).
A simple way of obtaining the global illumination estimation now is to calculate the vector
p(i) =
where e is the final global illumination estimation vector and
M is the number of pixels in the image. The division by M
is omitted since we are only interested in the direction of e
and not its norm. It is obvious that vectors p(i) with greater
corresponding value of w(i) will have a greater impact on the
final direction of vector e so w(i) can also be interpreted as
the weight. A simple alternative to Eq. 5 is to omit the weight
B. Pixel sampling
By looking at Fig. 3(b), it is clear that the difference
between intensity changes of spatially close pixels is not large.
This can be taken advantage of by not calculating p(i) for all
pixels, but only for some of them. For that purpose the row
step r and the column step c are introduced meaning that only
every cth pixel only in every rth row is processed. By doing so,
the computation cost is reduced, which can have an important
impact on speed when greater values for r and c are used.
CCVW 2013
Oral Session
C. Parameters
The proposed method inherits all parameters from RSR,
the two most important being the number of sprays N and
the size of individual sprays n. Additional parameters include
filter kernel type and size and r and c, the row step and column
step respectively for choosing the pixels for which to estimate
the illumination. As all of these parameters have a potential
influence on the final direction, a tuning is necessary. Due
to many possible combination of parameters, the parameters
inherited from RSR and not mentioned in this subsection are
set to values they were tuned to in [11]. Also in order to
simplify further testing, only the averaging kernel was used
due to its simplicity, low computation cost and no apparent
disadvantage over other kernels after several simple tests (it
even provided for a greater accuracy over the case when
Gaussian kernels were used).
D. Method’s name
As described in a previous subsection, the proposed method
is designed to have an adjustable pixel sampling rate. As this
allows the proposed method to ”fly” over some pixels without
visiting them, we decided to name the proposed method
Color Sparrow (CS).
A. Image database and testing method
For testing the proposed method and tuning the parameters the publicly available re-processed version [14] of the
ColorChecker database described in [8] was used. It contains
568 different linear RGB images taken both outside and in
closed space each of them containing a GretagMacbeth color
checker used to calculate the groundtruth illumination data
which is provided together with the images. The positions
of the whole color checker and individual color patches of
the color checker in each image are also provided, which
allowes for the color checker to be simply masked out during
the evaluation. As the error measure for white balancing the
angle between RGB value of groundtruth illumination and the
estimated illumination of an image was chosen. It should be
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
(a) Mean angle error
(b) Median angle error
Fig. 4: Performance of several parameter settings with respect to different kernel sizes
noted that because the images in the Color Checker database
are of medium variety [15], it might be necessary to retune
the parameters by using some other, larger databases in order
to obtain parameter values that would be a good choice for
images outside the Color Checker database.
Earlier in this paper two ways were proposed to estimate
the direction of the global illumination and these are represented by Eq. 5 and Eq. 6. As in the performed experiments
the former equation slightly outperformed the latter, all results
mentioned in this paper were obtained by using Eq. 5, which
means that there was no normalization of local intensity change
B. Tuning the kernel size
As explained before, the kernel type was fixed to averaging
kernel. Fig. 4 shows two aspects of kernel size influence on
method performance. All parameter settings have their r and
c parameters set to 10 and N was set to 1, while the value
of n varies. The reason to use only one value for N is that
raising it has insignificant impact on the performance. The
graphs on Fig. 4 show a clear performance difference between
application of Eq. 3, which uses no kernel, and Eq. 4. It is
interesting to note that different kernel sizes have only slight
impact on the mean angular error but this impact is greater
on the median error. While the mean raises from 3.7◦ to 3.8◦
only when using the 205 × 205 kernel, the median raises from
2.8◦ to 2.9◦ when using kernels of size greater than 25 × 25.
Therefore the kernel size should be in the range from 5 × 5 to
25 × 25 inclusive and in further tests the kernel size 5 × 5 is
CCVW 2013
Oral Session
C. Tuning N and n parameters
Both Fig. 5 and Fig. 4 show mean and median angle
between groundtruth global illumination and global illumination estimation of CS calculated on Shi’s images with various
values of parameter n for several fixed combinations of other
parameters. It can be seen the lowest mean angular error of
3.7◦ and lowest median error of 2.8◦ are achieved for the value
of n between 200 and 250 and that this is almost invariant
to values of other parameters. For that reason the result of
parameter n tuning is the value 225. As in previous tuning,
in this case using greater values for parameter N had also
only slight impact and was therefore omitted. Performance for
distinct image sizes was also tested by simply shrinking the
original Shi’s images and there was no significant difference
in results.
D. Tuning r and c parameters
In order to retain the lowest achieved median and mean
errors, the r and c parameter values can be raised up to 50
setting at the same time n to 225 a shown on Fig. 5. It is
interesting to mention that even setting r and c to 200 raises
the mean angular error only to 3.8◦ . However, as the median
is less stable, the values of r and c should not exceed 50 in
order to avoid a significant loss of accuracy as after this point
the median angle error raises to 2.9◦ .
E. Comparison to other algorithms
Table I shows performance of various methods. For the
Standard Deviation Weighted Gray-world (SDWGW) the images were subdivided into 100 blocks. Color Sparrow was
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
(a) Mean angle error
(b) Median angle error
Fig. 5: Performance of several parameter settings with respect to different sampling rates
TABLE I: Performance of different color constancy methods
mean (◦ )
median (◦ )
trimean (◦ )
max (◦ )
do nothing
Shades of gray
Intersection-based Gamut
Pixel-based Gamut
proposed method
computer with i5-2500K CPU and only one core was used.
Color Sparrow was again used with parameters N = 1, n =
225, r = 50, c = 50 and with an averaging kernel of size 5×5.
The algorithms were run several times on 100 images from the
Shi’s version of the Color Checker database. The average time
for several runs of Gray-world algorithm was 2.9s for 100
images and the average time for CS was 3.03s. These results
show that even though CS is slower, it still performs almost
as fast as Gray-world, but with more accuracy.
used with parameters N = 1, n = 225, r = 50, c = 50 and
with an averaging kernel of size 5 × 5. The performance of
other mentioned methods was taken from [16] where parameter
values are also provided. The method described in [17] is not
mentioned because the comparison would not be fair since this
was not designed for images recorded under assumed single
light source. It is possible to see that CS outperforms many
methods and is therefore a suitable choice for white balancing.
F. Speed comparison
As speed is an important factor property of white balancing algorithms, especially in digital cameras that have lower
computation power, a speed test was also performed for CS.
In order to compare the result of the speed test to something,
a speed test was also performed for the Gray-world algorithm
because it is one of the simplest and fastest white balancing
algorithms. For both speed tests a C++ implementation of both
algorithms was used on Windows 7 operating system on a
CCVW 2013
Oral Session
Color Sparrow is a relatively fast and accurate method for
white balancing. Even though in its core it is a modification of
RSR, it calculates only the global illumination estimation and
it also outmatches several other white balancing methods. It
has the advantage of being unsupervised and performing well
under lower sampling rates allowing a lower computation cost.
This leads to conclusion that using RSR for global illumination
estimation has a good potential of being a fast and accurate
unsupervised color constancy method. However, as tests used
in this paper were relatively simple, more exhaustive tests need
to be performed in order to see if the accuracy can be improved
even further. In future it would be good to test the proposed
method on other color constancy databases and to perform
experiments with other types of areas around particular pixels.
The authors acknowledge the advice of Arjan Gijsenij on
proper testing and comparison of color constancy algorithms.
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
This work has been supported by the IPA2007/HR/16IPO/001040514 project ”VISTA - Computer Vision Innovations for
Safe Traffic.”
[1] M. Ebner, Color constancy. Wiley, 2007, vol. 6.
[2] G. Buchsbaum, “A spatial processor model for object colour perception,” Journal of the Franklin institute, vol. 310, no. 1, pp. 1–26, 1980.
[3] G. D. Finlayson and E. Trezzi, “Shades of gray and colour constancy,”
[4] J. Van De Weijer, T. Gevers, and A. Gijsenij, “Edge-based color
constancy,” Image Processing, IEEE Transactions on, vol. 16, no. 9,
pp. 2207–2214, 2007.
[5] G. Finlayson, S. Hordley, and I. Tastl, “Gamut constrained illuminant
estimation,” International Journal of Computer Vision, vol. 67, no. 1,
pp. 93–109, 2006.
[6] V. C. Cardei, B. Funt, and K. Barnard, “Estimating the scene illumination chromaticity by using a neural network,” JOSA A, vol. 19, no. 12,
pp. 2374–2386, 2002.
[7] J. Van De Weijer, C. Schmid, and J. Verbeek, “Using high-level visual
information for color constancy,” in Computer Vision, 2007. ICCV 2007.
IEEE 11th International Conference on. IEEE, 2007, pp. 1–8.
[8] P. V. Gehler, C. Rother, A. Blake, T. Minka, and T. Sharp, “Bayesian
color constancy revisited,” in Computer Vision and Pattern Recognition,
2008. CVPR 2008. IEEE Conference on. IEEE, 2008, pp. 1–8.
[9] M. Šavc, D. Zazula, and B. Potočnik, “A Novel Colour-Constancy
Algorithm: A Mixture of Existing Algorithms,” Journal of the Laser
and Health Academy, vol. 2012, no. 1, 2012.
[10] H.-K. Lam, O. Au, and C.-W. Wong, “Automatic white balancing using
standard deviation of RGB components,” in Circuits and Systems, 2004.
ISCAS’04. Proceedings of the 2004 International Symposium on, vol. 3.
IEEE, 2004, pp. III–921.
[11] E. Provenzi, M. Fierro, A. Rizzi, L. De Carli, D. Gadia, and D. Marini,
“Random spray retinex: a new retinex implementation to investigate the
local properties of the model,” Image Processing, IEEE Transactions on,
vol. 16, no. 1, pp. 162–171, 2007.
[12] E. H. Land, J. J. McCann et al., “Lightness and retinex theory,” Journal
of the Optical society of America, vol. 61, no. 1, pp. 1–11, 1971.
[13] E. Provenzi, L. De Carli, A. Rizzi, and D. Marini, “Mathematical
definition and analysis of the Retinex algorithm,” JOSA A, vol. 22,
no. 12, pp. 2613–2621, 2005.
[14] B. F. L. Shi. (2013, May) Re-processed Version of the Gehler
Color Constancy Dataset of 568 Images. [Online]. Available:
[15] A. Gijsenij, T. Gevers, and J. Van De Weijer, “Computational color constancy: Survey and experiments,” Image Processing, IEEE Transactions
on, vol. 20, no. 9, pp. 2475–2489, 2011.
[16] T. G. A. Gijsenij and J. van de Weijer. (2013, May) Color Constancy
— Research Website on Illuminant Estimation. [Online]. Available:
[17] A. Gijsenij, R. Lu, and T. Gevers, “Color constancy for multiple light
sources,” Image Processing, IEEE Transactions on, vol. 21, no. 2, pp.
697–707, 2012.
CCVW 2013
Oral Session
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
Combining spatio-temporal appearance descriptors
and optical flow for human action recognition in
video data
Karla Brkić∗ , Srd̄an Rašić∗ , Axel Pinz‡ , Siniša Šegvić∗ and Zoran Kalafatić∗
∗ University
of Zagreb, Faculty of Electrical Engineering and Computing, Unska 3, 10000 Zagreb, Croatia
Email: [email protected]
‡ Graz University of Technology, Inffeldgasse 23/II, A-8010 Graz, Austria
Email: [email protected]
Abstract—This paper proposes combining spatio-temporal appearance (STA) descriptors with optical flow for human action
recognition. The STA descriptors are local histogram-based
descriptors of space-time, suitable for building a partial representation of arbitrary spatio-temporal phenomena. Because of
the possibility of iterative refinement, they are interesting in the
context of online human action recognition. We investigate the use
of dense optical flow as the image function of the STA descriptor
for human action recognition, using two different algorithms
for computing the flow: the Farnebäck algorithm and the TVL1 algorithm. We provide a detailed analysis of the influencing
optical flow algorithm parameters on the produced optical flow
fields. An extensive experimental validation of optical flow-based
STA descriptors in human action recognition is performed on
the KTH human action dataset. The encouraging experimental
results suggest the potential of our approach in online human
action recognition.
Human action recognition is one of the central topics of
interest in computer vision, given its wide applicability in
human-computer interfaces (e.g. Kinect [1]), assistive technologies [2] and video surveillance [3]. Although action recognition can be done from static images, the focus of current
research is on using video data. Videos offer insight into
the dynamics of the observed behavior, but with the price
of increased storage and processing requirements. Efficient
descriptors that compactly represent the video data of interest
are therefore a necessity.
This paper proposes combining the spatio-temporal appearance (STA) descriptors [4] with two variants of dense optical
flow for human action recognition. STA descriptors are a
family of histogram-based descriptors that efficiently integrate
temporal information from frame to frame, producing a fixedlength representation that is refined as new information arrives.
As such, they are suitable for building a model of an action
online, while the action is still happening, i.e. not all frames of
the action are available. The topic of building a refinable action
model is under-researched and of great practical importance,
especially in video surveillance and assistive technologies,
where it is important to raise an alarm as soon as possible
if an action is classified as dangerous.
Originally, STA descriptors were built on a per-frame basis,
by calculating and histogramming the values of an arbitrary
CCVW 2013
Oral Session
image function (e.g. hue, gradient) over a regular grid within
a region of interest in the frame. We propose extending this
concept to pairs of frames, so that optical flow computed
between pairs of frames is used as an image function whose
values are histogrammed within the STA framework. We
consider two variants of dense optical flow, the Farnebäck
optical flow [5], and the TV-L1 optical flow [6].
There is a large body of research concerning human action
recognition, covered in depth by a considerable number of
survey papers (e.g. [7], [8], [9], [10]).
Poppe [8] divides methods for representing human actions
into two categories: local and global. In local representation
methods, the video containing the action is represented as a
collection of mutually independent patches, usually calculated
around space-time interest points. In global representation
methods, the whole video is used in building the representation. An especially popular category of local representation
methods are based on a generalization of the 2D bag-of-visualwords framework [11] to spatio-temporal data. They model
local appearance, often using optical flow, and commonly use
histogram-based features similar to HOG [12] or SIFT [13].
In this overview we focus on these methods, as they bear the
most resemblance to our approach.
The standard processing chain the bag-of-visual-words
methods use is summarized in [14]. It includes finding spatiotemporal interest points, extracting local spatio-temporal volumes around these points, representing them as features and
using these features in a bag-of-words framework for classification. Interest point detectors and features used are most
commonly generalizations of well-known 2D detectors and
features. For instance, one of the earliest proposed spatiotemporal interest point detectors, proposed by Laptev and Lindeberg [15], is a spatio-temporal extension of the Harris corner
detector [16]. Another example is the Kadir-Brady saliency
[17], extended to temporal domain by Oikonomopoulos et
al. [18]. Willems et al. [19] introduce a dense scale-invariant
spatio-temporal interest point detector, a spatio-temporal counterpart of the Hessian saliency measure. However, there are
Proceedings of the Croatian Computer Vision Workshop, Year 1
space-time-specific interest point detectors as well, such as
the one proposed by Dollár et al. [20].
Representations of spatio-temporal volumes extracted
around interest points are typically histogram-based. For example, Dollár et al. [20] propose three methods of representing
volumes as features: (i) by simply flattening the grayscale
values in the volume into a vector, (ii) by calculating the
histogram of grayscale values in the volume or (iii) by dividing
the volume into a number of regions, constructing a local
histogram for each region and then concatenating all the histograms. Laptev et al. [21] propose dividing each volume into
a regular grid of cuboids using 24 different grid configurations,
and representing each cuboid by normalized histograms of
oriented gradient and optical flow. Willems et al. [19] build
on the work of Bay et al. [22] and represent spatio-temporal
volumes using a spatio-temporal generalization of the SURF
descriptor. In the generalization, they approximate Gaussian
derivatives of the second order with their box filter equivalents
in space-time, and use integral video for efficient computation.
Wang et al. [14] present a detailed performance comparison
of several human action recognition methods outlined here.
The actions are represented using combinations of six different
local feature descriptors, and either three different spatiotemporal interest point detectors or dense sampling instead
of using interest points. Three datasets are used for the experiments: the KTH action dataset [23], the UCF sports action
dataset [24] and the highly complex Hollywood2 action dataset
[25]. The best results are 92.1% for the KTH dataset (Harris3D
+ HOF), 85.6% for the UCF sports dataset (dense sampling
+ HOG3D), and 47.4% for the Hollywood2 dataset (dense
sampling + HOG/HOF). Experiments indicate that regular
sampling of space-time outperforms interest point detectors.
Also note that histograms of optical flow (HOF) are the bestperforming features in 2 of 3 datasets.
The use of STA descriptors [4] can offer a somewhat different perspective on the problem of human action recognition.
To build an STA descriptor, one needs a person detector and
a tracker, as the STA algorithm assumes that bounding boxes
around the object of interest are known. Although this is a
shortcoming when compared with the outlined bag-of-visualwords methods that do not need any kind of information
about the position of the human, the STA descriptors come
with an out-of-the box capability of building descriptions of
partial actions, and are therefore worth considering. The bagof-visual-words methods could also be generalized to support
partial actions, but the generalization is not straightforward.
A. Spatio-temporal appearance descriptors
Spatio-temporal appearance (STA) descriptors [4] are fixedlength descriptors that represent a series of temporally related
regions of interest at a given point in time. Two variants exist:
STA descriptors of the first order (STA1) and STA descriptors
of the second order (STA2). Let us assume that we have a
series of regions of interest defined as bounding boxes of a
human performing an action. The action need not be complete:
CCVW 2013
Oral Session
September 19, 2013, Zagreb, Croatia
assume that the total duration of the action is T , and we have
seen t < T frames. To build an STA descriptor, one first
divides each available region of interest into a regular grid of
rectangular patches. The size of the grid, m × n (m is the
number of grid columns and n is the number of grid rows), is
a parameter of the algorithm. One then calculates an arbitrary
image function (e.g. hue, gradient) for all the pixels of each
patch, and represents the distribution of the values of this
function over the patch by a k1 -bin histogram. Therefore, for
each available region of interest, one obtains an m × n grid of
k1 -binned histograms, called grid histograms. Let g(θ) denote
a m × n × k1 vector that contains concatenated histogram
frequencies of all the m × n histograms of the grid in time
θ, called the grid vector. The STA1 descriptor at time t is
obtained by weighted averaging of grid vectors,
STA1 (t) =
αθ g(θ) .
As a weighted average, the STA1 descriptor is an adequate
representation of simpler spatio-temporal phenomena, but fails
to capture the dynamics of complex behaviors such as human
actions. When averaging, the information on the distribution of
relative bin frequencies is lost. The STA2 descriptor solves this
problem, by explicitly modeling the distribution of each grid
histogram bin value over time. Let the vector ci , called the
component vector, be a vector of values of the i-th component
g(θ) (i) of the grid vector g(θ) up to and including time t,
1 ≤ θ ≤ t:
ci = g(1) (i), g(2) (i), g(3) (i), . . . , g(t) (i) .
The STA2 descriptor in time t is obtained by histogramming
the m × n × k1 component vectors, so that each component
vector is represented by a k2 -bin histogram, called the STA2
histogram. To obtain the final descriptor, one concatenates the
bin frequencies of m × n × k1 STA2 histograms into a feature
STA2 (t) = Hk2 (c1 ), Hk2 (c2 ), . . . , Hk2 (cmnk1 ) . (3)
Here the notation Hk2 (c) indicates a function that builds a k2 bin histogram of values contained in the vector c and returns
a vector of histogram bin frequencies. Note that the STA1
descriptor has a length of m × n × k1 components, while the
STA2 descriptor has a length of m × n × k1 × k2 components.
B. Farnebäck and TV-L1 optical flow
We consider using the following two algorithms for estimating dense optical flow: the algorithm of Farnebäck [5] and
the TV-L1 algorithm [6].
Farnebäck [5] proposes an algorithm for estimating dense
optical flow based on modeling the neighborhoods of each
pixel by quadratic polynomials. The idea is to represent the
image signal in the neighborhood of each pixel by a 3D
surface, and determine optical flow by finding where the
surface has moved in the next frame. The optimization is not
precision and running time. A small value will yield more accurate solutions at the expense of
a slower convergence.
Proceedings of the Croatian Computer Vision Workshop, Year 1
• η is the downsampling factor. It is used to downscale the original images in order to create the
pyramidal structure. Its value must be in the interval
(0, 1).19, 2013, Zagreb, Croatia
• Nscales is used to create the pyramid of images. If the flow field is very small (about one pixel),
it can be set to 1. Otherwise, it should be set so that (1/η)N − 1 is larger than the expected
size of the largest displacement. See our article on Horn–Schunck optical flow [3] for further
details on this and the η parameters.
• Nwarps represents the number of times that I1 (x + u0 ) and ∇I1 (x + u0 ) are computed per scale.
done on a pixel-level, but rather on a neighborhood-level, soThis is a parameter that assures the stability of the method. It also affects the running time,
that the optimum displacement is found both for the pixel andso it is a compromise between speed and accuracy.
its neighbors.
The TV-L1 optical flow of Zach et al [6] is based on a
Table 1: Parameters of the method.
Parameter Description
Default value
robust formulation of the classical Horn and Schunck approach
time step
[26]. It allows for discontinuities in the optical flow field and
data attachment weight
robustness to image noise. The algorithm efficiently minimizes
stopping threshold
a functional containing a data term using the L1 norm and a
zoom factor
regularization term using the total variation of the flow [27].
number of scales
C. Combining STAs and optical flow
number of warps
The STA descriptors are built using grids of histograms
of values of an arbitrary image function. Optical flow 5is an
image function that considers pairs of images, and assigns a Fig. 1: A few example frames from the KTH dataset: (a)
Figures 2 and 3 show the optical flows for the Ettlinger–Tor and the Rheinhafen sequences, respecwalking,
running, (d) boxing, (e) waving, (f)The results
vector to each pixel of the first image. The main issue tively.
to beThese
be found at
obtained are
similar to the results in Zach et al. [6]. In these examples we have used the following
addressed in building spatio-temporal appearance descriptors
parameters: τ =0.25, θ=0.5, η = 0.5, ε=0.01, 5 scales and 5 warpings. The algorithm is executed for
that use optical flow is how to build histograms of vectors.
three values of λ.
The issue of building histograms of vectors is well-known, and
addressed in e.g. HOG [12] or SIFT [13] descriptors. The idea
is to divide the 360◦ interval of possible vector orientations
into the desired number of bins, and then count the number of
vectors falling into each bin. In HOG and SIFT descriptors, the
vote of each vector is additionally weighted by its magnitude,
so that vectors with greater magnitudes bear more weight.
When building optical flow, it is interesting to consider both
Figure 1: Color scheme used to represent the orientation and magnitude of optical flows.
the optical flow orientation histograms weighted by magnitude, Fig. 2: An illustration of the optical flow coloring scheme that
we use.
[27].in the Ettlinger–Tor traffic scene. It also
The method
the displacement
of the
and the “raw” optical flow orientation histograms, where there
is no weighting by magnitude, i.e. all orientation votes
are a regular background motion maybe due to a small displacement of the camera and the
effect of noise. For small values of λ (λ=0.03), the solution is smoother and the optical flow is
considered equal, as depending on the data the magnitude
(160 × 120 pixels). There is some
variation in the viewpoint.
information can often be noisy.
among actors, as does
Note that when combining STA descriptors with optical
and homogeneous. A
flow, there is an inevitable lack of flow information for the
are shown in Fig. 1.
last frame of the sequence, as there are no subsequent frames.
The sequences from the KTH action dataset do not come
If one wishes to represent the entire sequence, the problem
of the lack of flow can be easily solved by building the STA with a per-frame bounding box annotations. Therefore, in
this paper we use the publicly available annotations of Lin
descriptors in the frame before last.
et al. [28]. These annotations were obtained automatically,
using a HOG detector, and are quite noisy, often only partially
In our experiments, we consider the standard benchmark enclosing the human.
KTH action dataset and first investigate the effect of optical
flow parameters on the produced optical flow and the descrip- B. Optimizing the parameters of optical flow
In order to compute the optical flow fields necessary for
tivity of the derived STA descriptors. Using the findings from
these experiments, we then use the best parameters of the building the STA descriptors, we used the implementations of
flow and optimize the parameters of STA descriptors to arrive Farnebäck and TV-L1 optical flow from OpenCV 2.4.5 [29].
at the final performance estimate. Our experiments include Both implementations have a number of parameters that can
only the STA descriptors of the second order (STA2), as influence the output optical flow and, in the end, classification
STA1 descriptors are of insufficient complexity to capture the performance. To evaluate the influence of individual optical
flow parameters on overall descriptivity of the representation,
dynamics of human actions.
we set up a simple test environment based on a support vector
A. The KTH action dataset
machine (SVM) classifier. We fixed the parameters of the STA
The KTH human action dataset [23] consists of videos of descriptor, using a grid of 8 × 6 patches, the number of grid
25 actors performing six actions in four different scenarios. histogram bins k1 = 8 and the number of STA2 histogram bins
The actions are walking, jogging, running, boxing, hand k2 = 5. We then performed a series of experiments where we
waving and hand clapping, and the scenarios are outdoors, built a number of STA2 descriptors of the data, varying the
outdoors with scale variation, outdoors with different clothes values of a single optical flow parameter for each descriptor
and indoors. The videos are grayscale and of a low resolution built. The parameter evaluation procedure can be summarized
CCVW 2013
Oral Session
Proceedings of the Croatian Computer Vision Workshop, Year 1
as follows:
(i) For the given parameter set of the optical flow algorithm,
calculate the STA2 descriptors over all sequences of the
KTH action dataset.
(ii) Train the SVM classifier using 25-fold cross-validation,
so in each iteration the sequences of one person are
used for testing, and the sequences of the remaining 24
persons for training.
(iii) Record cross-validation performance.
We used an OpenCV implementation of a linear SVM classifier, with termination criteria of either 105 iterations or an
error tolerance of 10−12 . Other classifier parameters were
set to OpenCV defaults, as the goal was not to optimize
performance, but to use it as a comparison measure for
different optical flow parameters. The optical flow histograms
were built both with and without magnitude-based weighting
of orientation votes. The obtained results were slightly better
when using weighting, so the use of weighting is assumed in
all experiments presented here.
1) Farnebäck optical flow: For the Farnebäck algorithm,
we evaluated the influence of three parameters: the averaging
window size w, the size of the pixel neighborhood considered
when finding polynomial expansion in each pixel s, and the
standard deviation σ of the Gaussian used to smooth derivatives in the polynomial expansion. The remaining parameters
were set to their default OpenCV values.
Visualizations of the computed optical flow when the parameters are varied, along with the corresponding obtained
SVM recognition rates, are shown in Fig. 3, on an example
frame of a running action. In this figure, we adhere to the color
scheme for visualization of optical flow proposed by Sanchéz
et al. [27], where each amplitude and direction of optical flow
is assigned a different color (see Fig. 2). Subfigures 3 (a)-(c)
show the influence of the change of parameter w, with fixed
values of s and σ, subfigures 3 (d)-(e) show the influence
of the change of parameter s, with fixed values of w and σ,
and subfigures 3 (g)-(i) show the influence of the change of
parameter σ, with fixed values of w and s. It can be seen that
the changes along any of the three parameter axes significantly
impact the derived optical flow field. Still, the recognition rate
in all cases is close to 80%, except when the averaging window
size is set to 1, resulting in a recognition rate of 69%. As
illustrated in Fig. 3 (a), the optical flow in that case is present
only in pixels very close to the human contour, which seems to
be insufficient information to build an adequately descriptive
representation. Use of a too big averaging window (Fig. 3 (c))
results in noise, and seems to cause a drop in performance.
The size of the considered pixel neighborhood does not seem
to have a considerable impact on performance, indicating the
robustness of the Farnebäck method. Increasing σ smooths the
derived optical flow, but this blurring lowers performance, so
the values of σ should be kept low.
2) TV-L1 optical flow: For the TV-L1 optical flow, we
consider the following parameters: the weight parameter for
the data term λ, the tightness parameter θ, and the time step
of the numerical scheme τ .
CCVW 2013
Oral Session
September 19, 2013, Zagreb, Croatia
(a) w = 1 (69.4%)
(b) w = 2 (81.1%)
(c) w = 5 (79.7%)
(d) s = 3 (80.9%)
(e) s = 5 (81.1%)
(f) s = 7 (81.0%)
(g) σ = 1.1 (81.1%)
(h) σ = 1.3 (79.4%)
(i) σ = 1.7 (78.0%)
Fig. 3: Obtained Farnebäck optical flow fields and corresponding obtained recognition rates (in parentheses) for varying
values of: (a)-(c) parameter w (s = 5, σ = 1.1); (d)(f) parameter s (w = 2, σ = 1.1); (g)-(i) parameter σ
(w = 2, s = 5).
Fig. 4 shows visualizations of the computed optical flow
when the parameters are varied, along with the corresponding
obtained SVM recognition rates. Subfigures 4 (a)-(c) show the
influence of the change of parameter λ, with fixed values of
θ and τ , subfigures 4 (d)-(e) show the influence of the change
of parameter θ, with fixed values of λ and τ , and subfigures
4 (g)-(i) show the influence of the change of parameter τ ,
with fixed values of λ and θ. The computed optical flow is
noticeably crisper than when using he Farnebäck method, and
there is significantly more noise in the background. Setting
a mid-range value of the parameter λ, that influences the
smoothness of the derived optical flow, seems to provide
the optimum balance between reducing background noise and
retaining optical flow information. Parameter θ should be kept
low, as increasing it results in blurring of the derived flow and
loss of information on the contour of the human. Parameter τ
does not seem to significantly impact performance, but a midrange value performs best. Overall, Fig. 4 suggests that STA
descriptors benefit from a crisp optical flow around the contour
of the human and as low background noise as possible.
C. Optimizing the parameters of STA descriptors
Based on our analysis of the properties of individual optical
flow parameters for the Farnebäck and the TV-L1 algorithm,
we were able to select a good parameter combination that
should result in solid classification performance. Still, one
should also consider optimizing the parameters of the STA
descriptors, that were fixed in the previous experiment (grid
size of 8 × 6, k1 = 8, k2 = 5). Additionally, one should
optimize the parameters of the used classifier in order to
obtain the optimum performance estimate. We considered
simultaneously optimizing STA and classifier parameters with
the goal of finding the best cross-validation performance
on the KTH action dataset, using STA2 descriptors built
using either Farnebäck or TV-L1 optical flow. We fixed the
parameters of the Farnebäck and the TV-L1 algorithms to
the best-performing ones, as found in the previous section
(Farnebäck: w = 2, s = 5, σ = 1.1, TV-L1: λ = .05,
θ = .1, τ = .15). A Cartesian product of the following STA
parameter values was considered: m = {3, 6}, n = {6, 8},
k1 = {4, 5, 8}, k2 = {5, 8}. For each parameter combination,
we performed 25-fold cross-validation multiple times, doing
an exhaustive search over classifier parameter space to obtain
optimum classifier parameters. Due to heavy computational
load involved, we switched from using SVM to using a
random forest classifier, because it is faster to train and offers
performance that in our experiments turned out to be only
slightly reduced when compared to SVM. We optimized the
number of trees, the number of features and depth of the
random forest classifier. A custom implementation based on
the Weka library was used [30]. We repeated the experiments
for both Farnebäck and TV-L1-based STA descriptors.
The best recognition rate obtained when using Farnebäckbased descriptors was 82.4%, obtained for an 8 × 6 grid, grid
histograms of 8 bins and STA2 histograms of 5 bins. The same
recognition rate is obtained when using 8 STA2 histogram
bins, but we favor shorter representations. The confusion table
for the best-performing Farnebäck-based descriptor is shown
in Table I. Notice how the confusion centers around two
CCVW 2013
Oral Session
TABLE I: The confusion table for the best-performing classifier that uses STA2 feature vectors based on the Farnebäck
optical flow to represent KTH action videos. Vertical axis: the
correct class label, horizontal axis: distribution over predicted
Fig. 4: Obtained TV-L1 optical flow fields and corresponding
obtained recognition rates (in parentheses) for varying values
of: (a)-(c) λ (τ = 0.25, θ = 0.3); (d)-(f) θ (τ = 0.25, λ =
0.15); (g)-(i) τ (λ = 0.15, θ = 0.3).
(i) τ = .35 (79.4%)
(h) τ = .15 (80.1%)
(f) θ = .5 (79.5%)
(g) τ = .05 (79.6%)
(e) θ = .3 (80.1%)
(c) λ = .30 (79.4%)
(d) θ = .1 (81.9%)
(b) λ = .05 (82.2%)
(a) λ = .01 (80.7%)
September 19, 2013, Zagreb, Croatia
Proceedings of the Croatian Computer Vision Workshop, Year 1
TABLE II: The confusion table for the best-performing classifier that uses STA2 feature vectors based on the TV-L1 optical
flow to represent KTH action videos. Vertical axis: the correct
class label, horizontal axis: distribution over predicted labels.
groups: boxing, clapping and waving, and jogging, running
and walking. Clapping, waving and boxing are visually similar,
as they both include arm movement and static legs. Although
in the KTH sequences the boxing action is filmed with the
person facing sideways, so the arm movement should occur
only on one side of the bounding box, due to noisy annotations
it is common that the bounding box of a clapping or a waving
action includes only one arm of a person, resulting in a similar
motion pattern. The jogging, running, and walking actions are
also similar, due to the movement of the legs and the general
motion of the body. The greatest confusion is between jogging
and running, which is understandable given the variations of
performing these actions among actors. Some actors run very
similarly to the jogging of others, and vice versa.
For the TV-L1-based STA descriptors, the best obtained
recognition rate was 81.6%, obtained for a 3 × 6 grid, with
8-bin grid histograms and 5-bin STA2 histograms. As less
flow information is generated when using TV-L1, the STA
descriptors seem to benefit from using a coarser grid than
in the Farnebäck case. The confusion table for the bestperforming TV-L1-based descriptor is shown in Table II.
Again, the most commonly confused classes can be grouped
into two groups: boxing, clapping and waving, and jogging,
running and walking. The Farnebäck-based descriptors seem
to be better in separating examples among these two groups
(e.g. when using Farnebäck-based descriptors no jogging actions got classified as boxing, clapping and waving, while
when using TV-L1-based descriptors 10 of them did).
Proceedings of the Croatian Computer Vision Workshop, Year 1
In this paper we proposed an extension of the STA descriptors with optical flow and applied the concept to the problem of
human action recognition. A detailed experimental evaluation
of two different optical flow algorithms has been provided,
with an in-depth study of the properties of individual parameters of each algorithm. We obtained encouraging performance
rates, with the descriptors based on the Farnebäck optical
flow performing slightly better than the descriptors based on
TV-L1. The obtained results suggest that a combination of
STA descriptors and optical flow could be used as a feasible
representation in a setting that requires partial action models.
In [31], similar performance (83.4%) on the KTH action
dataset was obtained using gradient-based STA descriptors,
and that approach generalized well to a partial action setting.
Although better performance rates have been obtained on the
KTH dataset [32], our approach is simple, easily extended
to other applications, and suitable for building a refinable
representation. Therefore, we believe that it merits further
In future work, we plan to obtain better annotations of
humans in video in hopes of improving the overall performance, explore the suitability of optical flow-based STAs to
partial action data, and train an SVM classifier with optimized
parameters and compare it with the random forest classifier.
This research has been supported by the Research Centre
for Advanced Cooperative Systems (EU FP7 #285939).
[1] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore,
A. Kipman, and A. Blake, “Real-Time Human Pose Recognition
in Parts from Single Depth Images,” 2011. [Online]. Available:
[2] R. Manduchi, S. Kurniawan, and H. Bagherinia, “Blind guidance
using mobile computer vision: A usability study,” in ACM
SIGACCESS Conference on Computers and Accessibility (ASSETS),
2010. [Online]. Available:
[3] C. Chen and J.-M. Odobez, “We are not contortionists: Coupled adaptive
learning for head and body orientation estimation in surveillance video,”
in Proc. CVPR, 2012, pp. 1544–1551.
[4] K. Brkić, A. Pinz, S. Šegvić, and Z. Kalafatić, “Histogrambased description of local space-time appearance.” in SCIA, ser.
Lecture Notes in Computer Science, A. Heyden and F. Kahl,
Eds., vol. 6688. Springer, 2011, pp. 206–217. [Online]. Available:
[5] G. Farnebäck, “Two-frame motion estimation based on polynomial
expansion,” in Proceedings of the 13th Scandinavian Conference on
Image Analysis, ser. SCIA’03. Berlin, Heidelberg: Springer-Verlag,
2003, pp. 363–370.
[6] C. Zach, T. Pock, and H. Bischof, “A duality based approach for realtime
tv-l1 optical flow,” in Proceedings of the 29th DAGM conference on
Pattern recognition. Berlin, Heidelberg: Springer-Verlag, 2007, pp.
[7] V. Krüger, D. Kragić, A. Ude, and C. Geib, “The Meaning of Action: A
Review on action recognition and mapping,” Advanced Robotics, vol. 21,
no. 13, pp. 1473–1501, 2007.
[8] R. Poppe, “A survey on vision-based human action recognition,” Image
and Vision Computing, vol. 28, no. 6, pp. 976 – 990, 2010.
[9] D. Weinland, R. Ronfard, and E. Boyer, “A survey of vision-based methods for action representation, segmentation and recognition,” Computer
Vision and Image Understanding, vol. 115, no. 2, pp. 224 – 241, 2011.
CCVW 2013
Oral Session
September 19, 2013, Zagreb, Croatia
[10] J. Aggarwal and M. Ryoo, “Human activity analysis: A review,” ACM
Computing Surveys, vol. 43, no. 3, pp. 16:1–16:43, 2011. [Online].
[11] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray, “Visual
categorization with bags of keypoints,” in Proceedings of the European
Conference on Computer Vision (ECCV), 2004, pp. 1–22.
[12] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
detection,” in Proceedings of the International Conference on Pattern
Recognition, C. Schmid, S. Soatto, and C. Tomasi, Eds., vol. 2, INRIA
Rhône-Alpes, ZIRST-655, av. de l’Europe, Montbonnot-38334, 2005,
pp. 886–893. [Online]. Available:
[13] D. G. Lowe, “Distinctive image features from scale-invariant
keypoints,” International Journal of Computer Vision, vol. 60, no. 2,
pp. 91–110, 2004. [Online]. Available:
[14] H. Wang, M. M. Ullah, A. Kläser, I. Laptev, and C. Schmid, “Evaluation
of local spatio-temporal features for action recognition,” in BMVC, 2009.
[15] I. Laptev and T. Lindeberg, “Space-time interest points,” in Proc. ICCV,
2003, pp. 432–439.
[16] C. Harris and M. Stephens, “A combined corner and edge detector,” in
Proceedings of Fourth Alvey Vision Conference, 1988, pp. 147–151.
[17] T. Kadir and M. Brady, “Saliency, scale and image description,”
International Journal of Computer Vision, vol. 45, no. 2, pp. 83–105,
2001. [Online]. Available:
[18] A. Oikonomopoulos, I. Patras, and M. Pantić, “Spatiotemporal salient
points for visual recognition of human actions,” Transactions on
Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 36, no. 3, pp.
710–719, 2005. [Online]. Available:
[19] G. Willems, T. Tuytelaars, and L. Van Gool, “An efficient dense and
scale-invariant spatio-temporal interest point detector,” in Proceedings
of the European Conference on Computer Vision (ECCV). Berlin,
Heidelberg: Springer-Verlag, 2008, pp. 650–663. [Online]. Available:
[20] P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior recognition
via sparse spatio-temporal features,” in VS-PETS, 2005.
[21] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld, “Learning
realistic human actions from movies,” in Proc. CVPR, 2008. [Online].
[22] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up
robust features (SURF),” Computer Vision and Image Understanding,
vol. 110, no. 3, pp. 346–359, 2008. [Online]. Available: http:
[23] C. Schüldt, I. Laptev, and B. Caputo, “Recognizing human actions: A
local SVM approach,” in Proceedings of the International Conference
on Pattern Recognition, 2004, pp. 32–36. [Online]. Available:
[24] M. D. Rodriguez, J. Ahmed, and M. Shah, “Action MACH: a spatiotemporal maximum average correlation height filter for action recognition,” in Proc. CVPR, 2008.
[25] M. Marszałek, I. Laptev, and C. Schmid, “Actions in context,” in
Proc. CVPR, 2009. [Online]. Available:
[26] B. K. P. Horn and B. G. Schunck, “Determining optical flow,” Artificial
Intelligence, vol. 17, pp. 185–203, 1981.
[27] J. Sánchez Pérez, E. Meinhardt-Llopis, and G. Facciolo, “TV-L1 Optical
Flow Estimation,” Image Processing On Line, vol. 2013, pp. 137–150,
[28] Z. Lin, Z. Jiang, and L. S. Davis, “Recognizing actions by shape-motion
prototype trees,” in Proc. ICCV, 2009, pp. 444–451.
[29] G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal of Software
Tools, 2000.
[30] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H.
Witten, “The WEKA data mining software: an update,” ACM SIGKDD
Explorations Newsletter, vol. 11, no. 1, pp. 10–18, 2009.
[31] K. Brkić, “Structural analysis of video by histogram-based description of
local space-time appearance,” Ph.D. dissertation, University of Zagreb,
Faculty of Electrical Engineering and Computing, 2013.
[32] S. Sadanand and J. J. Corso, “Action bank: A high-level representation
of activity in video,” Proc. CVPR, vol. 0, pp. 1234–1241, 2012.
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
An Overview and Evaluation of Various Face and
Eyes Detection Algorithms for Driver Fatigue
Monitoring Systems
Markan Lopar and Slobodan Ribarić
Department of electronics, microelectronics, computer and intelligent systems
Faculty of Electrical Engineering and Computing, University of Zagreb
Zagreb, Croatia
[email protected], [email protected]
Abstract—In this work various methods and algorithms for
face and eyes detection are examined in order to decide which of
them are applicable for use in a driver fatigue monitoring system.
In the case of face detection the standard Viola-Jones face
detector has shown best results, while the method of finding the
eye centers by means of gradients has proven to be most
appropriate in the case of eyes detection. The later method has
also a potential for retrieving behavioral parameters needed for
estimation of the level of driver fatigue. This possibility will be
examined in future work.
Keywords—driver fatigue monitoring; face detection; eyes
detection; behavioral parameters
The driver’s loss of attention due to drowsiness or fatigue is
one of the major contributors in road accidents. According to
National Highway Traffic Safety Administration [1] more than
100,000 crashes per year in United States are caused by drowsy
driving. This is a conservative estimation, because the drowsy
driving is underreported as a cause of crashes. Also the
accidents caused by driver’s inattention are not accounted.
According to some French statistics for 2003 [2], more
accidents were caused by inattention due to fatigue or sleep
deprivation than by alcohol or drug intoxication. Therefore,
significant efforts were made over the past two decades to
develop robust and reliable safety systems intended to reduce
the number of accidents, as well as to introduce new, or
improve existing techniques and technologies for development
of these systems.
Different techniques are used in driver-fatigue monitoring
systems. These techniques are divided into four categories [3].
The first category includes intrusive techniques, which are
mostly based on monitoring biomedical signals, and therefore
require physical contact with the driver. The second category
includes non-intrusive techniques based on visual assessment
of driver’s bio-behavior from face images. The third category
includes methods based on driver’s performance, which
monitor vehicle behaviors such as moving course, steering
angle, speed, braking, etc. Finally, the fourth category
combines techniques from the abovementioned three
The computer vision based techniques from the second
category are particularly effective, because the drowsiness can
be detected by observing the facial features and visual biobehavior such as head position, gaze, eye openness, eyelid
movements, and mouth openness. This category encompasses a
wide range of methods and algorithms with respect to data
acquisition (standard, IR and stereo cameras), image
processing and feature extraction (Gabor wavelets, Gaussian
derivatives, Haar-like masks, etc.), and classification (neural
networks, support vector machines, knowledge-based systems,
etc.). A widely accepted parameter that is measured by most of
methods is percentage of eye closure over time [4], commonly
referred as PERCLOS. This measure was established by in [5]
as the proportion of time in a specific time interval during
which the eyes are at least 80% closed, not including eye
blinks. Also, there are other parameters that can be measured
such as average eye closure speed (AECS) proposed by Ji and
Yang [6].
The initial task in development of driver fatigue monitoring
system is to build a module for reliable real-time detection and
tracking of facial features such as face, eyes, mouth, gaze, and
head movements. This paper examines some of existing
techniques for face and eyes detection.
Many efforts in about past twenty years were made to
develop feature detection systems for purpose of driver fatigue
monitoring. One of widely popular methods is to use active
illumination for eye detection. Grace et al. [7] have used
special camera for capturing images of a driver at different
wavelengths. Using the fact that at different wavelengths
human retina reflects different amount of light, they subtract
two images of a driver taken in this way in order to obtain an
image which contains non-zero pixels at the eyes locations.
Many variations of this image subtracting technique are
implemented later. One such notable implementation is
presented by Ji and Yang [6]. In their work they use active
infra-red illumination system which consists of CCD camera
with two sets of infra-red LEDs distributed evenly and
This work is supported by EU-funded IPA project “VISTA – Computer
Vision Innovations for Safe Traffic”.
CCVW 2013
Oral Session
Proceedings of the Croatian Computer Vision Workshop, Year 1
symmetrically along the circumference of two coplanar and
concentric rings, where the center of both rings coincides with
the camera optical axis. The light from two rings enters the eye
at different angle resulting in images with bright pupil effect in
case when the inner ring is turned on, and with dark pupil
effect when the outer ring is switched on. The resulting images
are subtracted in order to obtain the eyes positions. This
method of eye detection is used by many of other researchers
[8, 9].
While the abovementioned approaches use active
illumination and as such rely on special equipment, there are
also methods that employ other strategies. Thus Smith, Shah,
and da Vitoria Lobo [10] present a system which relies on
estimation of global motion and color statistics to track a
person’s head and facial features. For the eyes and lips
detection they employ color predicates, histogram-like
structures where the cells are labeled as either a members or
non-members of desired color range [11]. Another work that
uses color information for facial features detection is presented
by Rong-ben, Ke-you, Shu-ming, and Jiang-wei [12]. Their
method is based on the assumption that in RGB color space red
and green components of human skin follow Gaussian planar
distribution. The R and G values of each pixel are examined,
and if they fall within 3 standard deviations of R and G’s
means, the pixel is considered to belong to skin. When the face
is localized the Gabor wavelets are applied for extracting the
eye features. The same procedure is later applied for mouth
detection [13].
D’Orazio, Leo, Spagnolo, and Guaragnella [14] present the
method for eyes detection using operator based on Circle
Hough Transform [15]. This operator is applied on the entire
image and the result is a maximum that represents the region
possibly containing an eye. The second eye is then searched in
two opposite directions compatible with the range of possible
eyes position concerning the distance and orientation between
the eyes. The obtained results are subjected to testing of
similarity, which is evaluated by calculating the mean absolute
error applied on mirrored domains. If this similarity measure
falls below certain threshold, the regions are considered the
best match for eye candidate.
Ribarić, Lovrenčić, and Pavešič [3] have presented a driver
fatigue monitoring system which combines a neural-networkbased feature extraction module with knowledge-based
inference module. The feature extraction module consists of
several components that perform various tasks such as face
detection, detection of in-plane and out-of-plane head rotations,
and estimation of eye and mouth openness. The face detection
procedure uses combined appearance-based and neural network
approaches. The first step of algorithm applies a hybrid method
based on some features of the HMAX model [16] and ViolaJones face detector [17]: A linear template matching is
performed using fixed-size Haar-like masks, and then a nonlinear MAX operator is applied in order to gain invariance to
transformations of an input image. The features obtained by
applying MAX operator are used as inputs to the 3-layered
perceptron in the second step of face detection algorithm.
These features are chosen from fixed-size window that slides
through the maps of features, and the value of a single output
neuron for each position of the sliding window is recorded in
CCVW 2013
Oral Session
September 19, 2013, Zagreb, Croatia
an activation map. The activation map is binarized afterwards,
and the center of gravity of largest region with non-zero value
is considered to be the center of detected face. The estimation
of the angle of the in-plane rotation is implemented using 5layered convolutional neural network. The inputs are the
features obtained in the face detection routine before applying
MAX operator. The output layer has 36 neurons which
correspond to various rotation angles. The estimation of the
angle of the out-of-plane rotation is carried out in a similar
manner, except that the angles of pan and tilt rotations are
estimated from horizontal and vertical offsets from the center
of the face and the point that lies in the intersection of the nose
symmetry axis and the line connecting the eyes. The location
of the eyes and mouth is determined on the basis of the center
of the face, as well as the angles of in-plane and out-of-plane
rotations. Two separate convolutional neural networks are
employed to determine the eye and mouth openness.
Two more approaches are presented here which deal with
the problem of eyes detection. Though they are originally not
discussed in the context of driver fatigue monitoring, they can
nevertheless be applied for that purpose. The first work is by
Valenti and Gevers [18]. Their approach is based on the
observation that eyes are characterized by radially symmetric
brightness pattern, and isophotes are used to infer the center of
circular patterns. The center is obtained by voting mechanism
which is used to increase and weight important votes to
reinforce center estimates. The second method is proposed by
Timm and Barth [19]. It is relatively simple method which is
based on fact that all circle gradients intersect in the center of a
circle. Thus the eye centers can be obtained at locations where
most of the image gradients intersect. Moreover, this method is
robust enough to detect partially occluded circular objects, e.g.
half-open eyes. This is great advantage over the Circle Hough
Transform which in this case would fail.
The aim of this work is to test and compare some methods
for face and eyes detection and to decide which of them are
feasible for implementation in driver fatigue monitoring
system. The particular attention has been paid to accuracy in
feature detection and ability to operate in real time. The
selected methods are tested on video sequences captured with
web camera in dark and light ambient and with various
background scenery. The tests are performed on the machine
with two cores, each working at 2,4 GHz. Since the extensive
quantitative measurements are not available at the moment, the
obtained results are presented in descriptive form only. These
results are acquired by visual inspection: the accuracy is
estimated by observing the frequency of correctly detected
features, and feasibility for real time processing is evaluated
with regard to how smoothly the respective algorithm runs.
A. Face Detection
Face detectors are implemented using several methods.
These methods include face detection by means of searching
for ellipses, detection using SIFT [20] and SURF features [21],
application of Viola-Jones face detector [17], and use of feature
extraction module described in [3].
Proceedings of the Croatian Computer Vision Workshop, Year 1
Since the human head is shaped close to an ellipse, it is
convenient to try to detect it by searching for ellipses in a given
image. Ellipses can be found by Generalized Hough Transform
[22], but this procedure is computationally intensive. Prakash
and Rajesh [23] are recently presented the method for ellipse
detection which reduces the whole issue to application of
Circle Hough Transform, and this method is used in this work.
Nevertheless, it is not fast enough to operate in real time. Also,
since there are lots of false positives, it is necessary to
implement some learning procedure, and this additionally
slows down the detection procedure.
SIFT and SURF are well-known algorithms for object
matching. Both of these methods perform well in terms of
matching of facial features in template and target image, with
SURF running slightly faster. Unfortunately, both of them have
one major drawback: they are very sensitive to even small
changes in illuminating conditions. An attempt was made to
circumvent this problem by applying histogram equalization
and histogram fitting [24], but with little success, since in that
case the detection runs much slower.
The Viola-Jones detector is well-known face detector that
has found its use in many applications that involve face
detection. It performs very well considering both accuracy and
speed needed for real-time processing. The facial feature
extractor described in [3] which implements some features of
Viola-Jones detector, also performs the task of face detection
very well.
The advantages and drawbacks of each face detector are
summarized in TABLE I.
Matches well the shape of
human head
An excellent method for
object matching
An excellent method for
object matching, faster
than SIFT
Accurate and very fast
Pavešič [3]
Accurate and fast
Very slow, high number
of false positives
Very high sensitivity to
changes in illumination
Very high sensitivity to
changes in illumination
Small number of false
positives, needs learning
Small number of false
positives, long training
B. Eyes Detection
Four methods are used for the eyes detection. These include
Viola-Jones eyes detector, the algorithm described by Valenti
and Gevers [18], the approach proposed by Timm and Barth
[19], and feature extractor by Ribarić, Lovrenčić, and Pavešič
[3]. All of these methods rely on previously detected face and
for purpose of face detection the Viola-Jones detector is used,
since it performs the best. The exception is the method by
Ribarić, Lovrenčić, and Pavešič, since it uses its own face
CCVW 2013
Oral Session
September 19, 2013, Zagreb, Croatia
The Viola-Jones performs reasonably well, but it produces
a slightly higher amount of false positives, especially in the
regions of mouth corners.
The detector described by Valenti and Gevers performs
excellent regarding both speed and accuracy. It never fails to
find the eyes, but it is not perfect in accurate localization of eye
centers, since it more often than not misdetects the eye corner
as the eye center.
The detector by Timm and Barth performs the best of all
four eye detectors. It runs slightly slower than the detector by
Valenti and Gevers, but it almost never fails to find accurately
the eye center.
Finally, the feature extractor by Ribarić, Lovrenčić, and
Pavešič is not as good in the task of eye detection as in the case
of face detection. The cause of this lies in the dependency
chain of detected features. The detected face is prerequisite for
in-plane rotation detection. The out-of-plane rotation detection
needs previously detected in-plane rotation, and finally the eyes
and mouth detection takes place after the out-of-plane rotation
is detected. Thus the error detection grows larger in each step
of this cascade, and it has highest value at the end of the chain,
i.e. in case of eyes and mouth detection.
The advantages and drawbacks of each eyes detector are
summarized in TABLE II.
Valenti and
Pavešič [3]
Accurate and very fast
Reliable and very fast
Very accurate in eye
center detection
Fast detector
Has more false positives
than the face detector,
needs learning
Misdetects the eye
corners as the eye centers
Slower than detector by
Valenti and Gevers
Lower accuracy due to
features dependency
In this work we have tested some methods for face and eyes
detection in order to decide which of them is the best for
implementation in driver fatigue monitoring system. ViolaJones face detector has proven to be fastest and most accurate
among the face detectors, while the algorithm proposed by
Timm and Barth outperformed the rest of eyes detectors.
The eyes detection method described by Timm and Barth
has also one interesting property that may be utilized in driver
fatigue monitoring systems. Namely, by simple counting the
number of intersecting gradients and comparing it to some
referent value it is possible to determine the degree of eyes
openness/closeness and to calculate some behavioral
parameters such as PERCLOS. This possibility will definitely
be examined in the future.
Proceedings of the Croatian Computer Vision Workshop, Year 1
The future work also includes researching and development
of methods for mouth detection and determination of degree of
mouth openness. The use of thermovision camera for image
acquisition is also planned in the future in order to examine
whether some facial features can be obtained from images
acquired in this way. Thermovision cameras are at the moment
too expensive to be used in commercial driver fatigue
monitoring systems, but their potential usefulness nevertheless
may and should be examined in researching activities.
After the facial features are obtained, a certain number of
behavioral parameters – such as PERCLOS, degree of mouth
openness, degree of head rotation, and gaze estimation – will
be extracted from facial features. These parameters will be
used in inference module which should decide whether the
driver is fatigued or not, and issue an appropriate warning or
alarm signal if it is necessary. In any case, the task of
monitoring driver fatigue is a challenge to be handled, and
many efforts should be made to deal with it successfully.
September 19, 2013, Zagreb, Croatia
Research on Drowsy Driving, National Highway Traffic Safety
M. Kutila, “Methods for Machine Vision Based Driver Monitoring
Applications”, VTT Technical Research Centre of Finland, pp. 1-70,
S. Ribarić, J. Lovrenčić, N. Pavešić, “A Neural-Network-Based System
for Monitoring Driver Fatigue”, MELECON 2010 – 2010 15th IEEE
Mediterranean Electrotechnical Conference, pp. 1356-1361, Apr. 2010,
R. Knipling, and P. Rau, “PERCLOS: A valid psychophysiological
measure of alertness as assessed by psychomotor vigilance”, Office of
Motor Carriers, FHWA, Tech. Rep. FHWA-MCRT-98-006, Oct. 1998,
Washington DC.
W. W. Wierwille, L. A. Ellsworth, S. S. Wreggit, R. J. Fairbanks, C. L.
Kirn, “Research on Vehicle-Based Driver Status/Performance
Monitoring: Development, Validation, and Refinement of Algorithms
for Detection of Driver Drowsiness”, National Highway Traffic Safety
Administration Final Report: DOT HS 808 247, 1994.
Q, Ji, and X. Yang, “Real-Time Eye, Gaze, and Face Pose Tracking for
Monitoring Driver Vigilance”, Real-Time Imaging 8, pp. 357-377, 2002.
R. Grace, V. E. Byrne, D. M. Bierman, J.-M. Legrand, D. Gricourt, B.
K. Davis, J. J. Staszewski, B. Carnahan, “A Drowsy Driver Detection
System for Heavy Vehicles”, Proc. of the 17th DASC.
AIAA/IEEE/SAE. Digital Avionics System Conference, vol. 2, issue 36,
pp. 1-8, Nov. 1998, Bellevue, WA.
H. Gu, Q. Ji, Z. Zhu, “Active Facial Tracking for Fatigue Detection”,
Proc. of the 6th IEEE Workshop on Applications of Computer Vision,
pp. 137-142, Dec. 2002, Orlando, FL.
CCVW 2013
Oral Session
Q. Ji, Z. Zhu, P. Lan, “Real-Time Nonintrusive Monitoring and
Prediction of Driver Fatigue”, IEEE Transactions on Vehicular
Technology, vol. 53, No. 4, pp. 1052-1068, July 2004.
P. Smith, M. Shah, N. da Vitoria Lobo, “Determining Driver Visual
Attention with One Camera”, IEEE Transactions on Intelligent
Transportation Systems, vol. 4, No. 4, pp. 205-218, December 2003.
R. Kjeldsen, and J. Kender, “Finding Skin in Color Images”, Proc. of the
2nd International Conference on Automatic Face and Gesture
Recognition, pp. 312-317, Oct. 1996, Killington, VT.
W. Rong-ben, G. Ke-you, S. Shu-ming, C. Jiang-wei, “A Monitoring
Method of Driver Fatigue Behavior Based on Machine Vision”,
Intelligent Vehicles Symposium, 2003. Proceedings. IEEE, pp. 110-113,
Jun. 2003.
W. Rongben, G. Lie, T. Bingliang, J. Lisheng, “Monitoring Mouth
Movement for Driver Fatigue or Distraction with One Camera”, Proc. of
the 7th International IEEE Conference on Intelligent Transportation
Systems, pp. 314-319, Oct. 2004, Washington, DC.
T. D’Orazio, M. Leo, P. Spagnolo, C. Guaragnella, “A Neural System
for Eye Detection in a Driver Vigilance Application”, Proc. of the 7th
International IEEE Conference on Intelligent Transportation Systems,
pp. 320-325, Oct. 2004, Washington, DC.
R.O. Duda, P.E. Hart, “Use of the Hough Transform to detect lines and
curves in pictures,” Comm. ACM 15, pp. 11-15, 1972.
T. Serre, L. Wolf, T. Poggio, “Object Recognition with Features
Inspired by Visual Cortex”, IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR 2005), vol. 2, pp. 9941000, Jun. 2005, San Diego, CA.
P. Viola, and M. Jones, “Robust Real-Time Object Detection”, 2nd
International Workshop on Statistical and Computational Theories of
Vision – Modeling, Learning, Computing, and Sampling, Jul. 2001,
R. Valenti, and T. Gevers, “Accurate Eye Center Location Through
Invariant Isocentric Patterns”, IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 34, pp. 1785-1798, Sept. 2012.
F. Timm, and E. Barth, “Accurate eye centre localisation by means of
gradients”, In: Int. Conference on Computer Vision Theory and
Applications (VISAPP), vol. 1, pp. 125-130, March 2011, Algarve,
D.G. Lowe, “Object recognition from local scale-invariant features”,
Proceedings of the Seventh IEEE International Conference on Computer
Vision, vol 2, pp. 1150-1157, Sept. 1999, Kerkyra, Greece.
H. Bay, A. Ess, T. Tuytelaars, L. van Gool, “SURF: Speeded up robust
features”, Computer Vision and Image Understanding (CVIU), vol. 110,
No. 3, pp. 346-359, 2008.
D.H. Ballard, “Generalizing the Hough Transform to detect arbitrary
shapes”, Pattern Recognition, vol. 13, No. 2, pp. 111-122, 1981.
J. Prakash, and K. Rajesh, “Human Face Detection and Segmentation
using Eigenvalues of Covariance Matrix, Hough Transform and Raster
Scan Algorithms”, World Academy of Science, Engineering and
Technology 15, 2008.
R.C. Gonzalez, and R.E. Woods, “Digital image processing”, Prentice
Hall, 3rd ed., 2007.
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
Classifying traffic scenes using the GIST image
Ivan Sikirić
Karla Brkić, Siniša Šegvić
Mireo d.d.
Buzinski prilaz 32, 10000 Zagreb
e-mail: [email protected]
University of Zagreb
Faculty of Electrical Engineering and Computing
e-mail: [email protected]
Abstract—This paper investigates classification of traffic
scenes in a very low bandwidth scenario, where an image should
be coded by a small number of features. We introduce a novel
dataset, called the FM1 dataset, consisting of 5615 images of
eight different traffic scenes: open highway, open road, settlement,
tunnel, tunnel exit, toll booth, heavy traffic and the overpass. We
evaluate the suitability of the GIST descriptor as a representation
of these images, first by exploring the descriptor space using PCA
and k-means clustering, and then by using an SVM classifier
and recording its 10-fold cross-validation performance on the
introduced FM1 dataset. The obtained recognition rates are very
encouraging, indicating that the use of the GIST descriptor alone
could be sufficiently descriptive even when very high performance
is required.
This paper aims to improve current fleet management systems by introducing visual data obtained by cameras mounted
inside vehicles. Fleet management systems track the position
and monitor the status of many vehicles in a fleet, to improve
the safety of their cargo and to prevent their unauthorized
and improper use. They also accumulate this data for the
purpose of generating various reports which are then used
to optimise the use of resources and minimise the expenses.
Fleet management systems usually consist of a central server
to which hundreds of simple clients report their status. Clients
are typically inexpensive embedded systems placed inside a
vehicle, equipped with a GPS chip, a GPRS modem and
optional additional sensors. Their purpose is to inform the
server of the vehicle’s current position, speed, fuel level, and
other relevant information in regular intervals. Our contribution
is introduction of visual cues to the vehicle’s status. The
server could use these cues to infer the properties of the
vehicle’s surroundings, which would help it in the further
decision making. For example, server could infer the location
of the vehicle (e.g. open road, tunnel, gas station), or cause of
stopping (e.g. congestion, traffic lights, road works). In cases
of missing or inaccurate GPS data, the fleet management server
could use the location inferred from visual data to determine
which of the several possible vehicle routes is the correct one.
In some cases the server is not even aware of the loss of GPS
precision, and fails to discard improbable data. This usually
occurs in closed environments and near tall objects. Detecting
such scenarios using visual data would be very beneficial.
Also, detecting the cause of losing the GPS signal and the
cause of vehicle stopping is important in systems that offer
real-time tracking of vehicles used in transporting valuables.
CCVW 2013
Oral Session
We aim to solve this problem by learning a classification model
for each scene or scenario.
Transmitting the entire image taken from a camera in every
status report would raise the size of transmitted data by several
orders of magnitude (typically the size of status is less than
a hundred bytes), which would be too expensive. This can
be resolved by calculating the descriptor of the image on the
client itself, before transmitting it to the server for further
analysis. We chose to use the GIST descriptor by Oliva and
Torralba [1] because it describes the shape of the scene using
low dimensional feature vectors, and it performs well in scene
classification setups.
Our second contribution is introduction of a new dataset,
called the FM1 dataset, containing 5615 traffic scenes and
associated labels. Using this dataset we perform experimental
evaluation of the proposed method and report some preliminary
results. The remainder of the paper is organized as follows:
In the next section we give an overview of previous related
work, followed by a brief description of the GIST descriptor.
We then describe the introduced dataset in detail, and explore
the data using well known techniques. After that we define
the classification problem, describe the classification setup
and present the results. We conclude the paper by giving
an overview of contributions and discussing some interesting
future directions.
Although image/scene classification is a very active topic
in computer vision research, the work specific to road scene
classification is limited. Bosch et al. [2] divide image classification approaches into low-level approaches and semantic
approaches. Low-level approaches, e.g. [3], [4], model the
image using low-level features, either globally or by dividing
the image into sub-blocks. Semantic approaches aim to add a
level of understanding what is in the image. Bosch et al. [2]
identify three subtypes of semantic approaches: (i) methods
based on semantic objects, e.g. [5], [6], where objects in the
image are detected to aid scene classification, (ii) methods
based on local semantic concepts, such as the bag-of-visualwords approach [7], [8], [9], where meaningful features are
learned from the training data, and methods based on semantic
properties, such as the GIST descriptor [1], [10], where an
image is described by a set of statistical properties, such as
roughness or naturalness. One might argue that the GIST
descriptor should be categorized as a low-level approach, as it
uses low-level features to obtain the representation. However,
Proceedings of the Croatian Computer Vision Workshop, Year 1
unlike low-level approaches [3], [4] where local features are
the representation, the GIST descriptor merely uses low-level
features to quantify higher-level semantic properties of the
Ess et al. [11] propose a segmentation-based method for
urban traffic scene understanding. An image is first divided
into patches and roughly segmented, assigning each patch
one of the thirteen object labels including car, road etc.
This representation is used to construct a feature set fed to
a classifier that distinguishes between different road scene
Tang and Breckon [12] propose extracting a set of color,
edge and texture-based features from predefined regions of
interest within road scene images. There are three predefined
regions of interest: (i) a rectangle near the center of the image,
sampling the road surface, (ii) a tall rectangle on the left side of
the image, sampling the road side, and (iii) a wide rectangle
on the bottom of the image, sampling the road edge. Each
predefined region of interest has its own set of preselected
color features, including various components of RGB, HSV
and YCrCb color spaces. The texture features are based on
grey-level co-occurrence matrix statistics and Gabor filters.
Additionally, edge-based features are extracted in the road edge
region. A training set of 800 examples of four road scene categories is introduced: motorway, offroad, trunkroad and urban
road. The testing is performed on approximately 600 test image
frames. The k-NN and the artificial neural network classifiers
are considered. The best obtained recognition rate is 86% when
considering all four classes as separate categories, improving
to 90% when classes are merged into two categories: off-road
and urban.
Mioulet et al. [13] consider using Gabor features for road
scene classification, using the dataset of Tang and Breckon
[12]. Gabor features are extracted within the same three regions
of interest used by Tang and Breckon. Grayscale histograms
are built from Gabor result images and concatenated over all
three regions of interest to form the final descriptor. A random
forest classifier is trained and evaluated on a dataset from [12]
consisting of four image classes: motorway, offroad, trunkroad
and urban road, achieving the 10-fold cross-validation recognition rate of 97.6%.
In this paper, we are limited by our target application
that imposes a constraint on bandwidth and processing time.
Therefore, sophisticated approaches that require a lot of processing power to obtain the scene feature vector, such as the
segmentation-based method of Ess et al. [11], or the manyfeature method of Tang and Breckon [12], are unsuitable
for our problem. The closest to our application is the work
of Mioulet et al. [13], where a simple Gabor feature-based
approach performs better on the same dataset than the various
preselected features of Tang and Brackon. We take the idea of
Mioulet et al. a step further by using Gabor feature-based GIST
descriptor, which we hope will be an improvement over using
raw Gabor features. Furthermore, as our goal is classifying
the road scene into a much larger number of classes than are
available in the dataset of Tang and Breckon [12], we introduce
a new dataset of 8 scene categories.
CCVW 2013
Oral Session
September 19, 2013, Zagreb, Croatia
The GIST descriptor [1], [14] focuses on the shape of
scene itself, on the relationship between the outlines of the
surfaces and their properties, and ignores the local objects in
the scene and their relationships. The representation of the
structure of the scene, termed spatial envelope is defined, as
well as its five perceptual properties: naturalness, openness,
roughness, expansion and ruggedness, which are meaningful
to human observers. The degrees of those properties can be
examined using various techniques, such as Fourier transform
and PCA. The contribution of spectral components at different
spatial locations to spatial envelope properties is described
with a function called windowed discriminant spectral template
(WDST), and its parameters are obtained during learning
The implementation we used first preprocesses the input
image by converting it to grayscale, normalizing the intensities
and locally scaling the contrast. The resulting image is then
split into a grid on several scales, and the response of each
cell is computed using a series of Gabor filters. All of the cell
responses are concatenated to form the feature vector.
We believe this descriptor will perform very well in the
context of traffic scenes classification. Expressing the degree of
naturalness of the scene should enable us to differentiate urban
from open road environments (even more so in case of unpaved
roads). The degree of openness should differentiate the open
roads from tunnels and other types of closed environments,
and is especially useful in the context of fleet management,
where closed environments usually result in the loss of GPS
signal. Texture and shape of large surfaces is also taken into
account, and this could help separate the highways from other
types of roads.
Since we want to build a classifying model for traffic scenes
and scenarios relevant to the fleet management problems, we
introduce a novel dataset of labeled traffic scenes, called the
FM1 dataset1 .
The data was acquired using a camera mounted on a
windshield of a personal vehicle. Videos were recorded at 30
frames per second, with the resolution of 640x480. Several
drives were made on Croatian roads, ranging in length from
37 to 60 minutes, to a total of about 190 minutes, see Table
I. All videos were recorded on a clear sunny day, and the
largest percentage of the footage was recorded on highways.
The camera, its position and orientation were not changed
between videos. We plan to vary the camera and its position in
the later versions of the dataset, as well as to include nighttime
images, and images taken during moderate to heavy rain (when
the windshield wipers are operating).
A typical frame of a video is shown in Figure 1. Some
parts of the image are not parts of the traffic scene, but rather
the interior of a vehicle, such as camera mount visible in
the upper right corner, the dashboard in the bottom part, and
occasional reflections of car interior visible on the windshield.
The windshield itself can be dirty, and various artefacts can
appear on it depending on the position of the sun. The camera
Proceedings of the Croatian Computer Vision Workshop, Year 1
Fig. 1: A typical traffic scene. Note the camera mount (1), the
dashboard (2), the reflection of car interior (3) and the speck
of dirt (4)
TABLE I: Overview of traffic videos
Video #
Number of extracted images
mount and the dashboard are in identical position in all of the
images, so they should not influence the classification results.
However, we can not expect this will always be the case in the
future and plan to develop a system that will be robust enough
to accommodate for those changes.
For each recorded video, every 60th frame was extracted,
i.e. one every two seconds. This ensured there was no bias in
the selection process, and that subsequently selected images
would not be too similar.
Choosing the classes for this preliminary classification
evaluation was not an easy task. Perhaps the most obvious
classes are "tunnel" and "open highway", but even they are
not clearly defined. Should we insist the image be classified
as an open highway if there is a traffic jam, and almost entire
scene is obstructed by a large truck directly in front of us? We
can introduce the "heavy traffic" class to accommodate for this
case. What if it is not a truck, but a car instead, or if it is a
bit further away, so more of the highway is visible? There are
many cases in which it is impossible to set the exact moment
when a scene transforms from being a member of one class
and becomes a member of another class. Perhaps these issues
can be resolved by allowing each scene to be assigned multiple
class labels, which we plan to explore at later stages of our
Before choosing the class labels, we first analyzed the
data using PCA and clustering algorithms, to see if there are
obvious and easily separable classes in the dataset.
CCVW 2013
Oral Session
September 19, 2013, Zagreb, Croatia
Fig. 2: Data points projected into 2D (note the three clusters)
A. Exploring the data
The first step in data analysis was computing the descriptor
for every image. We used the MATLAB implementation of the
GIST descriptor provided by Oliva and Torralba [1], using the
default parameters, which produces a feature vector of length
512. Since the resolution of the original image is 640x480,
this provides a big dimensionality reduction.
The PCA was used to find the principal components in
data, which was then projected into a plane determined by
the first two principal components. The results can be seen
in Figure 2. We can see most of the data points belong to
one of three clusters, with some data points appearing to be
outliers. Around 3300 points belong to the cluster 1, 90% of
them corresponding to the scenes of open highway. The rest of
the points in cluster 1 correspond to various types of scenes,
but not a single one is a scene inside of a tunnel. Around 1400
data points belong to the cluster 2, 70% of them corresponding
to the scenes of open highway, and 20% to the scenes of other
types of open road. Around 820 points belong to the cluster 3,
45% of them corresponding to scenes inside the tunnels, 35%
to the open highway, 6% of them to the scenes at the or under
the toll booth. More than 95% of tunnel scenes belong to this
cluster, and most of the images in this cluster depict closed
The next step was using clustering algorithms to discover
easily solvable classes in the dataset. K-means clustering was
used on the dataset with values for K varying from 2 to 20.
For the case K = 2, one cluster contained almost all of the
tunnels and toll booth scenes, as was expected, but it also
contained about one third of the open road scenes. For the
case K = 3, one cluster contained most of the tunnels and
toll booth scenes, but very few open road scenes. The second
cluster contained almost all of the scenes with large number of
other vehicles, i.e. scenes of heavy traffic. For K = 5, we can
notice a good separation of non-highway open roads scenes.
Further raising of K shows that similar types of scenes in the
urban environments often end up in the same cluster. Some
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
(a) highway
(b) road
(c) tunnel
(d) exit
(e) settlement
(f) overpass
(g) booth
(h) traffic
Fig. 3: Examples of selected classes
TABLE II: Selected classes
class label
scene description
an open highway
an open non-highway road
in a tunnel, or directly in front of it, but not at the tunnel exit
directly at the tunnel exit (extremely bright image)
in a settlement (e.g. visible buildings)
in front of, or under an overpass (the overpass is dominant in the scene)
directly in front of, or at the toll booth
many vehicles are visible in the scene, or completely obstruct the view
TABLE III: distribution of classes across videos
Video #
clusters contained very similar images, which suggested some
easily separable classes, such as "a scene on a highway with
a rocky formation on the right side of the road", but most of
those classes were not considered useful to our application.
In conclusion, the clustering approach gave us some ideas
about easily separable and useful class labels to choose for
our final classification experiment. Once we obtain more data,
the clustering approach should be reapplied, probably in the
form of hierarchical clustering for easier analysis.
B. Selected classes
The set of classes we chose is listed in the Table II. Some
of the classes were chosen because the data analysis indicated
they would be easy to classify, and one of the functions of
fleet management is to simply archive any data that might
be required for purposes yet unknown. As was discussed
in the introduction, we are very interested in detecting the
environments in which the loss of GPS signal precision is
likely, or the vehicle is likely to stop or drive slowly. The
tunnel is a class which is both easy to classify and often causes
loss of GPS signal. The tunnel exit was separated into its
own class because the camera reaction to the sudden increase
CCVW 2013
Oral Session
in sunlight is very slow, which results in extremely bright
images (similar problem is not encountered during tunnel
entry). The settlement is an environment in which we are likely
to encounter tall objects (which may or may not be visible
in the scene), so the loss of GPS signal precision is more
likely to occur than on an open road, but less likely then in
a tunnel. The overpass is chosen because going under it can
cause a slight loss of GPS precision. Toll booths are included
because going through them can cause a loss of GPS precision,
and also because the vehicle must always stop at a toll booth.
Heavy traffic scenario is interesting because it can cause the
vehicle to stop or drive very slowly, and because presence of
many vehicles can obstruct the view of the camera enough that
the proper classification of the location becomes impossible.
Intersection is a class which would be very interesting to
investigate, but unfortunately we did not have enough samples,
and their variability was too great. The distribution of class
instances across videos is shown in Table III.
We used the data mining tool Weka 3 [15] to train several
types of general purpose classifiers, performing grid search
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
TABLE IV: Detailed accuracy by class
Weighted average
TP Rate
FP Rate
ROC Area
TABLE V: Confusion matrix
classified as:
(a) road
(b) settlement
(c) traffic
(d) tunnel
(e) overpass
(f) booth
Fig. 4: Examples of scenes misclassified as highway
optimisation of parameters for each of them. The best results
were obtained using SVM classifier with soft margin C = 512
and RBF kernel with γ = 0.125. The testing method was
stratified cross-validation with 10 folds, using entire dataset
(5615 feature vectors of length 512). Since the dataset is
greatly biased towards the highway class, we expect most other
classes to be often confused with a highway.
The achieved recognition rate was 97.3%. The detailed
accuracy by class is shown in Table IV, while the confusion
matrix is shown in Table V.
We can see the performance is very good across all classes,
except the overpass, which is often confused with a highway.
This is probably because most of the overpass scenes were in
fact on a highway, and the overpass was not equally dominant
in all of them, as well as because of low number of class
instances. The settlement class is often confused with highway
and road scenes. We plan to resolve this by introducing more
data, because instances of this class vary in appearance more
CCVW 2013
Oral Session
then of any other class. We also note that highway is often
confused with a plain road, and vice versa, which is to
be expected for classes with similar appearance. We do not
consider this confusion to be problematic, as fleet management
can usually use GPS data to correctly infer the type of the
road. Some examples of images misclassified as a highway
are shown in Figure 4.
Our preliminary results show that the GIST descriptor
alone is sufficiently descriptive for the purpose of classification
of many types of traffic scenes. This indicates viability of
the proposed method as an improvement of current fleet
management systems, and invites further research.
Further efforts should be directed towards expanding the
traffic scenes dataset, to include images during nighttime, and
bad weather, as well as images obtained by different cameras
mounted at other positions and angles. Also, the set of classes
Proceedings of the Croatian Computer Vision Workshop, Year 1
should be expanded to include many other interesting scenes
and scenarios. Poor classification of the overpass class should
be addressed, at first by adding more data. It is possible that
in some cases the overpass can not be considered an integral
part of the scene, but rather its attribute. There are also other
types of attributes of the traffic scenes that we might want
to consider, e.g. traffic lights. We plan to introduce additional
image features and detectors of such attributes.
One of our future goals would be to reduce or eliminate the
need for labeling. We are considering using automatic labeling
in cases where we can infer the details of vehicle’s environment
with high degree of probability using non-visual cues.
September 19, 2013, Zagreb, Croatia
L. Mioulet, T. Breckon, A. Mouton, H. Liang, and T. Morie, “Gabor
features for real-time road environment classification,” in Proc. International Conference on Industrial Technology, pp. 1117–1121, IEEE,
February 2013.
[14] M. Douze, H. Jégou, H. Sandhawalia, L. Amsaleg, and C. Schmid,
“Evaluation of gist descriptors for web-scale image search,” in Proceedings of the ACM International Conference on Image and Video
Retrieval, CIVR ’09, (New York, NY, USA), pp. 19:1–19:8, ACM,
[15] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H.
Witten, “The weka data mining software: an update,” SIGKDD Explor.
Newsl., vol. 11, pp. 10–18, Nov. 2009.
The authors would like to thank Josip Krapac for his very
helpful input during our work on this paper.
This research has been supported by the research projects
Research Centre for Advanced Cooperative Systems (EU FP7
#285939) and Vista (EuropeAid/131920/M/ACT/HR).
A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic
representation of the spatial envelope,” Int. J. Comput. Vision, vol. 42,
pp. 145–175, May 2001.
A. Bosch, X. Muñoz, and R. Martí, “Review: Which is the best way to
organize/classify images by content?,” Image Vision Comput., vol. 25,
pp. 778–791, June 2007.
N. Serrano, A. E. Savakis, and J. Luo, “Improved Scene Classification using Efficient Low-Level Features and Semantic Cues,” Pattern
Recognition, vol. 37, no. 9, pp. 1773–1784, 2004.
A. Vailaya, M. A. T. Figueiredo, A. K. Jain, and H. J. Zhang, “Image classification for content-based indexing,” IEEE Trans. on Image
Processing, vol. 10, pp. 117–130, January 2001.
J. Luo, A. E. Savakis, and A. Singhal, “A bayesian network-based
framework for semantic image understanding,” Pattern Recogn., vol. 38,
pp. 919–934, June 2005.
J. Fan, Y. Gao, H. Luo, and G. Xu, “Statistical modeling and conceptualization of natural images,” Pattern Recogn., vol. 38, pp. 865–885,
June 2005.
F.-F. Li and P. Perona, “A bayesian hierarchical model for learning
natural scene categories,” in Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR’05) - Volume 2 - Volume 02, CVPR ’05, (Washington, DC,
USA), pp. 524–531, IEEE Computer Society, 2005.
A. Bosch, A. Zisserman, and X. Muñoz, “Scene classification via plsa,”
in Proceedings of the 9th European conference on Computer Vision Volume Part IV, ECCV’06, (Berlin, Heidelberg), pp. 517–530, SpringerVerlag, 2006.
S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features:
Spatial pyramid matching for recognizing natural scene categories,”
in Proceedings of the 2006 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition - Volume 2, CVPR ’06,
(Washington, DC, USA), pp. 2169–2178, IEEE Computer Society,
A. Oliva and A. B. Torralba, “Scene-centered description from spatial
envelope properties,” in Proceedings of the Second International Workshop on Biologically Motivated Computer Vision, BMCV ’02, (London,
UK, UK), pp. 263–272, Springer-Verlag, 2002.
A. Ess, T. Mueller, H. Grabner, and L. v. Gool, “Segmentationbased urban traffic scene understanding,” in Proceedings of the British
Machine Vision Conference, pp. 84.1–84.11, BMVA Press, 2009.
I. Tang and T. Breckon, “Automatic road environment classification,”
IEEE Transactions on Intelligent Transportation Systems, vol. 12,
pp. 476–484, June 2011.
CCVW 2013
Oral Session
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
Computer Vision Systems in Road Vehicles:
A Review
Kristian Kovačić, Edouard Ivanjko and Hrvoje Gold
Department of Intelligent Transportation Systems
Faculty of Transport and Traffic Sciences
University of Zagreb
Email: [email protected], [email protected], [email protected]
Abstract—The number of road vehicles significantly increased
in recent decades. This trend accompanied a build-up of road
infrastructure and development of various control systems to
increase road traffic safety, road capacity and travel comfort. In
traffic safety significant development has been made and today’s
systems more and more include cameras and computer vision
methods. Cameras are used as part of the road infrastructure or
in vehicles. In this paper a review on computer vision systems
in vehicles from the stand point of traffic engineering is given.
Safety problems of road vehicles are presented, current state of
the art in-vehicle vision systems is described and open problems
with future research directions are discussed.
Recent decades can be characterized by a significant increase of the number of road vehicles accompanied by a buildup of road infrastructure. Simultaneously various traffic control
systems have been developed in order to increase road traffic
safety, road capacity and travel comfort. However, even as
technology significantly advanced, traffic accidents still take
a large number of human fatalities and injuries.
Statistics show that there is about 1.2 million fatalities and
50 million injuries related to road traffic accidents per year
worldwide [1]. Also there were 3 961 pedestrian fatalities in
2005 in the EU. Most accidents happened because drivers
were not aware of pedestrians. Furthermore, there has been
approximately 500 casualties a year in EU because of driver’s
lack of visibility in the blind zone. Data published in [1]
estimated that there has been around 244 000 police reported
crashes and 224 fatalities related to lane switching in 1991.
Statistics published by the Croatian Bureau of Statistics [2]
reveal that in 2011 there were 418 killed and 18 065 injured
persons who were participating in the traffic of Croatia.
To reduce the amount of annual traffic accidents, a large
number of different systems has been researched and developed. Such systems are part of road infrastructure or road vehicles (horizontal and vertical signalization, variable message
signs, driver support systems, etc.). Vehicle manufactures have
implemented systems such as lane detection and lane departure systems, parking assistants, collision avoidance, adaptive
cruise control, etc. Main task of these systems is to make the
driver aware when leaving the current lane. These systems are
partially based on computer vision algorithms that can detect
and track pedestrians or surrounding vehicles [1].
Development of computing power and cheap video cameras
enabled today’s traffic safety systems to more and more include
CCVW 2013
Poster Session
cameras and computer vision methods. Cameras are used as
part of road infrastructure or in vehicles. They enable monitoring of traffic infrastructure, detection of incident situations,
tracking of surrounding vehicles, etc.
The goal of this paper is to make a review of existing
computer vision systems from the problem stand point of
traffic engineering unlike the most reviews done from the
problem stand point of computer science. Emphasis is on
computer vision methods that can be used in systems build
in vehicles to assist the driver and increase traffic safety.
This paper is organized as follows. Second section gives
a review of road vehicles safety problems followed by the
third section that describes vision systems requirements for
the use in vehicles. Fourth section gives a review of mostly
used image processing methods followed by the fifth section
that describes existing systems. Paper ends with a discussion
about open problems and conclusion.
According to the survey study results published in [3],
annually about 100 000 crashes in the USA result directly from
driver fatigue. It is the main cause of about 30% of all severe
traffic accidents. In the case of French statistics, a lack of
attention due to driver fatigue or sleepiness was a factor in
one in three accidents, while alcohol, drugs or distraction was
a factor in one in five accidents in 2003 [4].
Based on the accidents records from [3] most common
traffic accidents causes are:
Frontal crashes, where vehicles driving in opposite
directions have collided;
Lane departure collisions, where a lane changing
vehicle collided with a vehicle from an adjacent lane;
Crashes with surrounding vehicles while parking,
passing through a intersection or a narrow alley, etc.;
Failures to see or recognize the road signalization and
consequently cause a traffic accident due to inappropriate driving.
A. Frontal crashes
According to the data published in [3], 30 452 of total
2 041 943 vehicles crashes recorded in the USA in the period
from 2005 to 2008 occurred because of too close following
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
of a vehicle in a convoy. About 5.5% of accidents happened
with vehicles moving in opposite direction. Although there are
various types of systems that can help to reduce the number of
accidents caused by a frontal vehicle crash, computer vision
gives ability to prevent a possible or immediate vehicle crash
from happening. System can detect a dangerous situation using
one or more frontal cameras and afterward produce appropriate
response, haptic or audio.
When a possible frontal crash situation occurs, the time
available to prevent the crash is usually very short. Efficiency
of crash avoidance systems depends on the time in which
possible crash has been detected prior to its occurrence. If short
amount of time is taken for possible frontal crash detection,
then longer amount of time is available for driver warning
or evasive actions to be performed. Such actions need to be
taken in order to prevent or alleviate the crash (preparation of
airbags, autonomous braking, etc.).
B. Lane departure
Statistics published in [3] denote that 10.8% accidents
happened because vehicles overran the lane line and 22.2%
because vehicles were over the edge of the road. From all
vehicles that participated in accidents, 3.1% were intentionally
changing lanes or overtaking another vehicle.
Number of accidents can be significantly reduced by using
a computer vision system for lane departure warning. If the
driver is distracted, such system can simply just accentuate
the dangerous action (lane changing) and focus the driver’s
attention. When such system is built into a vehicle, drivers
also tend to be more careful so that they do not cross the lane
markers because they receive some kind of warning otherwise
(vibrating driver seat, audio signal or graphical notice).
C. Surrounding vehicles
Surrounding vehicles represent a threat if the driver is not
aware of them. In case of driver fatigue, capability of tracking
surrounding vehicles is significantly reduced. This can cause an
accident in events such as lane changing, driving in a convoy,
parking and similar situations.
Depending on complexity of a system that tracks surrounding vehicles, appropriate warnings and actions can be
performed based on current vehicle environment state or state
that is estimated (predicted) to happen in the near future.
System that tracks surrounding vehicles can be used in lane
departure warning or parking assistant system also. Main
problem of this system are false detections of surrounding
vehicles. Vehicles can be in many different shapes and colors.
This can cause false vehicle estimations and wrong driver
actions endangering other drivers and road users [5].
D. Recognition of road signalization
Road signalization (horizontal and vertical) represents the
key element in traffic safety. Driver should always detect and
recognize both road signalization groups correctly, especially
traffic signs. In case when the driver is unfocused or tired,
horizontal and vertical signalization is mostly ignored. According to the survey [3], 3.8% of total vehicle crashes happened
when drivers did not adjust their driving to road signalization
CCVW 2013
Poster Session
Fig. 1.
General in-vehicle driver support system architecture.
recommendation. This accidents number could be reduced if
a convenient road signalization detection and driver assistant
system had been used. Such system could warn the driver when
he is performing a dangerous or illegal maneuver. According
to data in [3], there were 109 932 accidents caused by speeding
and ignoring speed limits.
In some cases drivers are overloaded with information and
have a hard time prioritizing information importance. For example, when driving through an area for the first time, drivers
are usually more concentrated on route information traffic
signs. Systems for road sign detection and recognition, can
improve driving safety and reduce above mentioned accidents
percentage by drawing the driver attention to warning signs.
To define an appropriate vision system architecture, some
application area related requirements need to be considered.
Because of the specific field of use, the computer vision
system has to work in real time. Obtained data should be
available in time after a possible critical situation has been
detected. Time window for computation depends on the vehicle
speed and shortens with the vehicle speed increase. This
can represent a significant problem when small and cheap
embedded computers are used. To overcome this problem algorithms that support parallel computing, multi-core processor
platforms or computer vision dedicated hardware platforms
are used. Alternatively special intelligent video cameras with
preinstalled local image processing software can be used. Such
cameras can perform extraction of basic image features, object
recognition and tracking.
Second important requirement is the capability to adapt to
rapid changes of environment monitored using cameras. Such
changes can be caused by fog, rain, snow, and illumination
changes related to night and day or entering and exiting
tunnels. In cases where whole driver support or vehicle control
systems use more then one sensor type (eg. optical cameras and
ultrasonic sensors, lidars), system adaptiveness to environment
changes is better then in cases with one sensor type. Furthermore, whole system needs to be resistant to various physical
influences (e.g. vibrations, acceleration/deceleration) [5]. In
Fig. 1 general in-vehicle vision system architecture is given.
To detect and track vehicles or other kind of objects
(pedestrians, lane markings, traffic signs, etc.) often different
sensors are used. Sensors can be divided into two categories:
active and passive. Most active sensors are time of flight
sensors, i.e. they measure the distance between a specific object
and the sensor. Millimeter-wave radars and laser sensors are
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
Fig. 2. Same image in visible spectrum (a) and in far-infrared spectrum (b).
mostly used as active sensors in automotive industry. Active
sensors are so mainly used for surrounding vehicles detection
and tracking.
Various cameras are mostly used as passive sensors. Basic
camera characteristics include their working spectrum, resolution, field of view, etc. In traffic application camera spectrum is
crucial since night-time cameras use far, mid and near infrared
spectrum, while daylight cameras use visible spectrum. Far infrared spectrum camera is suitable for detection of pedestrians
at night as shown in Fig. 2.
Computer vision systems in vehicles can be based on stereo
vision also. Stereo vision is a technique that enables extraction
of 3D data from a pair of 2D flat images. In case of a 2D image
pair, usually whole overlapping image or only a specific region
of it is translated from a 2D to 3D coordinate system.
Image processing consists of basic operations that provides
features for high level information extraction. Simplest image
processing transformations are done by point operators. Such
operators can perform color transformation, compositing and
matting (process of separating the image into two layers,
foreground and background), histogram equalization, tonal
adjustment and other operations. Point operators do not use
surrounding points in calculations. Contrariwise, neighborhood
operators (linear filtering, non-linear filtering, etc.) use information from surrounding pixels also. For image enhancement,
linear and non-linear filters (moving-average, Bartlett filter,
median filter, morphology, etc.) are used. Non-linear filters can
be used for shape recognition methods also. For example, as
part of vehicle recognition methods [6].
Affine transformation is a type of transformation that
preserves symmetries (parallelism) and distance ratios when
applied on certain points or parts of an image. As a consequence of this, specific geometric shape on an image can be
rotated, scaled, translated, etc. This transformation is usually
used for basic operations with a whole specific image region.
Distance estimation between vehicles is important in traffic
applications. To calculate the distance between the camera and
a point in space represented by a specific pixel in the image,
perspective (projection) transformation is performed. It is computed by applying a perspective matrix transformation on every
pixel in the image. Perspective transformation assigns a third
component (z coordinate) to every pixel depending on the x-y
components in the image. In Fig. 3, result of perspective image
transformation is given. Pixel components in the original image
are x and y coordinates. After perspective transformation a new
pixel component (z coordinate) is calculated and swapped with
the y coordinate. So, pixels near the image bottom of the image
represent points in 3D space that are closer to the camera and
pixels near the image top are further away.
CCVW 2013
Poster Session
Fig. 3. Original image (left) and same image with perspective projection
transformation applied (right) [7].
Hough transformation is used to extract parameters of
image features (lines). Main advantage of this method is its
relatively high tolerance to noise and ability to ignore small
gaps on shapes. After required processing (edge detection)
is done on an image, Hough transformation translates every
pixel of a feature in the image from the Cartesian to the polar
coordinate system. Original Hough transform method is used
for extracting only line parameters although various modified
versions of Hough transform have been developed to extract
corner, circle and ellipse parameters.
For object detection Paul Viola and Michael Jones proposed an approach called ’Haar-like features’ that has some
resemblance to Haar-wavelet functions [8]. Haar-like-features
methods are used where fast object detection is required [9],
[10]. This approach is based on dividing a detection window
(window of interest) into smaller regions. Region size depends
on the size of an object that needs to be detected. For each
region a sum is computed based on pixel intensity within the
processed region. Summed area table (integral image) is used
to perform a fast summarization. Possible object candidates
can be found comparing differences between sums of adjacent
regions. For a more accurate object detection, many regions
need to be tested. Applications such as vehicle detection
require use of more complex methods that have computational
demands similar to the Haar-like-features method.
One of the algorithms used for simplification of extracted
shapes is the Ramer-Douglas-Peucker algorithm [11]. It reduces the number of points in a curve that is approximated
by a series of points. Complex curves with many points
can be too computationally (time) demanding for processing.
Simplifying the curve with this algorithm allows faster and
simpler processing.
Since the first intelligent vehicle was introduced in the
mid 1970s [12], computer vision systems also started to
develop as one of its main sensor systems. Currently there are
many projects that are researching and developing computer
vision systems in road vehicles. Project goals are related to
development of driver support systems, driver state recognition
and road infrastructure recognition modules for autonomous
Computer vision systems in vehicles are part of autonomous vehicle parking systems, adaptive cruise control,
lane departure warning, driver fatigue detection, obstacle and
traffic sign detection, and others. Currently only high class
vehicles are equipped with several such systems but they are
being more and more included in middle class vehicles.
Proceedings of the Croatian Computer Vision Workshop, Year 1
A. Lane detection and departure warning
Lane detection system represents not only a useful driver
aid, but an important component in autonomous vehicles also.
To resolve lane detection and departure warning problems,
key element is to detect road lane boundaries (horizontal road
signalization and width of the whole road). In work described
in [13], lane detection is done in two parts. First part includes
image enhancement and edge detection. Second part deals with
lane features extraction and road shape estimation from the
processed image. In Fig. 4, more detailed flow diagram of this
algorithm is shown.
Authors of [14] developed an algorithm that handles extraction of lane features based on estimation methods. Besides
determining shape of main (current) lane, algorithm has the
ability to estimate position and shape of other surrounding
lanes. Applied lane detection principle is based on two methods. On a straight lane, Hough and simplified perspective
transformations mentioned in previous section are used first.
These transformations enable detection of the center lane.
To detect surrounding lanes, further processing needs to be
done. Required processing includes affine transformations and
other calculations that retrieve relative vehicle position. If the
lane is a curve with a large radius, this method can be used
also. In second method for curved lane with small radius,
edge features are extracted from the original image first. After
extracting these features, complete perspective transformation
is performed based on camera’s installation parameters. By
recognizing points on the center lane, curve estimation is
performed. Positions of other adjacent lanes are then calculated
based on known current lane curve parameters. Problems with
this approach are: (i) incapability to detect lane markers due
the poor visibility or object interference (infront vehicles or
other objects block the camera’s field of view); and (ii) error
in estimating current vehicle position due to the curvature of
the road [14].
B. Driver fatigue detection
The process of falling asleep at the steering wheel can be
characterized by a gradual decline in alertness from a normal
September 19, 2013, Zagreb, Croatia
state due to monotonous driving conditions or other environmental factors. This diminished alertness leads to a state
of fuzzy consciousness followed by the onset of sleep [15].
If drivers in fuzzy consciousness would get a warning one
second before possible accident situation, about 90% of accidents could be prevented [9]. Various techniques have been
developed to help to reduce the number of accident caused
by driver fatigue. These driver fatigue detection techniques
can be divided into three groups: (i) methods that analyze
driver’s current state related to eyelid and head movement,
gaze and other facial expression; (ii) methods based on driver
performance, with a focus on the vehicle’s behavior including
position and headway; and (iii) methods based on combination
of the driver’s current state and driver performance [15].
In work described in [9], two CCD cameras are used. First
camera is fixed and it is used to track the driver’s head position.
Camera output is an image in low-resolution like 320x240,
which allows the use of fast image processing algorithms.
As the driver’s head position is known, processing of second
camera image can focus only on that region. Image of second
camera is high-resolution one (typically 720x576) which raises
level of detail of the driver’s head. This image is used in
further analysis for determining driver’s performance from
mouth position, eye lid opening frequency and diameter of
opened eye lid [9]. It is visible from experimental results that
at least 30 frames (high-resolution images) can be processed
per second. Such frame rate represents good basis for further
research and development of driver fatigue detection systems.
In review [15], an approach to driver fatigue detection that
first detects regions of eyes and afterwards extracts information
about driver fatigue from this region is briefly described. To
achieve this, driver’s face needs to be found first. After finding
the face, eyes locations are estimated by finding the darkest
pixels on face. When eye locations are known, software tries
to extract further information for driver fatigue detection. In
order to test the algorithm different test drives were made. If
the driver’s head direction was straight into the camera none
false detection were observed. After head is turned for 30
or more degrees, system starts to produce false detections.
Beside described techniques, processing additional features
like blinking rate of driver’s eyes or position of eye’s pupil
are used to improve driver fatigue detection [9].
C. Vehicle detection
There are several sensors (active and passive) that can
be used for vehicle detection. Lot of research was done
regarding vehicle detection using optical cameras and this still
presents a challenge. Main problem of video image processing
based vehicle detection is significant variability in vehicle
and environment appearance (vehicle size and shape, color,
sunlight, snow, rain, dust, fog, etc.). Other problems arise
from the need for system robustness and requirements for fast
processing [7].
Fig. 4.
Flowchart of image processing for lane detection [13].
CCVW 2013
Poster Session
Most algorithms for vehicle detection in computer vision
systems are based on two steps: (i) finding all candidates in an
image that could be vehicles; and (ii) performing tests that can
verify the presence of a vehicle. Finding all vehicle candidates
in an image is mostly done by three types of methods: (i)
knowledge-based methods; (ii) stereo-vision methods; and (iii)
motion-based methods.
Proceedings of the Croatian Computer Vision Workshop, Year 1
Knowledge-based methods work on the presumption of
knowing specific parameters of the object (vehicle) being detected (vehicle position, shape, color, texture, shadow, lighting,
etc.). Symmetry can be used as one of the most important
vehicle specifications. Every vehicle is almost completely
symmetrical viewed from front and rear side and this is used
in vehicle detection algorithms. Besides symmetry, color of
the road or lane markers can be used in vehicle detection
also. Using color, moving objects (vehicles) can be segmented
from the background (road). Other knowledge-based methods
use vehicle shadow, corners (vehicle rectangular shape), vertical/horizontal edges, texture, vehicle lights, etc. [7].
Stereo-vision methods are based on disparity maps or
inverse perspective mapping. In disparity maps, position of
pixels on both cameras that represents the same object are
compared. Third component of pixel vector (z coordinate) can
be calculated by comparing changes in pixels position. Inverse
perspective mapping method is based on transforming the
image and removing the perspective effect from it. In stereovision systems mostly two cameras are used although systems
with more than two cameras are more accurate, but more
computationally expensive also. In [16], a computer vision
system with three cameras is described. It can detect objects
that are 9 [cm] in size and up to 110 [m] in distance. Such
a system has a high level of accuracy when compared to a
system with two cameras.
Motion-based methods use optical flow for image processing. Optical flow in computer vision represents the pattern of
visible movement of objects, surfaces and edges in the camera
field of view. Optical flow calculated from an image is shown
in Fig. 5, where arrow’s length and direction represents average
speed and direction (optical flow) of subimages. Algorithms
for this method divide the image into smaller parts. Smaller
image parts are analyzed with previously saved images in
order to calculated its relative direction and speed. With this
method static and dynamic image parts can be separated and
additionally processed [17].
Once optical flow of sub-images is calculated, moving
objects (vehicles) candidates can be recognized. A verification
method decides which candidates will be referenced as real
objects. Verification is done by calibrating a speed threshold on
the most efficient value. With this filtering, false detections are
reduced. For further verification, methods that are also used in
knowledge based vehicle detection methods (appearance and
template based) are used. These kind of methods recognize
Fig. 5.
Original image (a) and computed optical flow (b) [7].
CCVW 2013
Poster Session
September 19, 2013, Zagreb, Croatia
vehicle features like edges, texture, shadow, etc.
D. Real time traffic sign detection
Traffic sign detection is performed in two phases. First
phase is usually done by color segmentation. Since traffic signs
have edges of a specific color, color segmentation represents a
good base for easier extraction of traffic signs from image
background. In [10], color-based segmentation is done in
two steps: color quantization followed by region of interest
analysis. For color model, RGB color space is usually used,
although YIQ, YUV, L*a*b and CIE can be used also. After
color segmentation, shape-based segmentation is performed
for final detection of circles, ellipses and triangles. Second
phase is traffic sign classification where various methods like
template matching, linear discriminant analysis, support vector
machine (SVM), artificial neural networks, and other machine
learning methods can be used [10]. For traffic sign detection
AdaBoost [18] and SURF [19] algorithms are used also.
Traffic sign detection approach described in [11] performs
recognition in two main steps. First step is related to a traffic
sign detection algorithm whose goal is to localize a traffic sign
in the image with minimum noise. For detection, image preprocessing with color-based method threshold is used first. This
method is fast and often suitable for use in real-time systems.
After preprocessing, traffic sign approximation is performed
with Ramer-Douglas-Peucker algorithm. On localized traffic
sign image, preprocessing is performed and image is converted
to binary image with resolution of 30x30 pixels. Second step
represents the traffic sign recognition based on SVM, often
C-SVM with linear kernel.
The developed traffic sign detection and recognition system
described in [20] is based on a video sequence analysis
rather than a single image analysis. Authors use a cascade of
boosted classifiers of Haar-like-features (usually pixel intensity
value) method to detect traffic signs. OpenCV framework is
used to alleviate the implementation. Main disadvantages of
traffic sign detection based on a Haar-like-features method
boosted classifier cascade are: (i) poor precision compared to
the required precision; and (ii) poor traffic sign localization
accuracy where its true location (position in the image) is in
most cases deviated. Framework for detecting, tracking and
recognizing traffic signs proposed in [20] is shown in Fig. 6.
Fig. 6.
Flowchart of sign detection and recognition algorithm [20].
Proceedings of the Croatian Computer Vision Workshop, Year 1
Since traffic sign detection is performed on a sequence of
images obtained from a moving vehicle, a region trajectory
where most probably the signs are located can be constructed.
Detection system learns on each detection. If a significant
number of learning samples (traffic sign images) is used,
system can achieve precision of up to 54% with recall of 96%.
September 19, 2013, Zagreb, Croatia
supported by the project VISTA financed from the Innovation
Investment Fund Grant Scheme and EU COST action TU1102.
Main constrain of computer vision systems in vehicles are
their limitations due to hardware specifications. Processing an
image through a large number of consecutive methods for
data recognition and extraction can be very CPU demanding.
Trade off between quality and cost of every step in image
processing should be done in order to make a system maximally optimized. In image processing, two most widely used
approaches are basic image features extraction using Hough
transformation and learning classificators for whole object
recognition. Implementation of classificators like the one based
on Haar-like-features methods allows algorithms to use less
system resources. Drawback of this method is its low ratio
of accuracy. Contrary to this, Hough method is much more
precise, but uses more system resources. Methods that perform
complex calculations on all image pixels should be avoided and
used only where they are necessary.
One of the factors that affects vehicle detection accuracy,
tracking and recognition of objects in an outdoor environment
are environmental conditions. Weather conditions such as
snow, rain, fog and others can significantly reduce computer
vision system accuracy. Poor quality of horizontal road signalization and omitted field of view due to various obstacles (often
problem with traffic signs) presents an additional problem to
vision based systems also. Significant robustness is required
to overcome this problem. To reduce the influence that these
factors can have on the system, new cameras and appropriate
image processing algorithms still need to be developed.
Computer vision software used for vehicle detection and
obtained recognition consist of many subsystems. Incoming
images from one or several cameras need to be preprocessed
in the desired format. Obtained image is then further processed
for high level object detection and recognition.
Detecting an object in a scene can be done by multiple
methods (eg. Hough and Haar-like-features methods). Use of
this methods alone is not enough to make the system reliable
for object detection and recognition. In cases of bad environment conditions, image can be full of various noise (snow, fog
and rain on the image). Besides environment conditions, high
vehicle velocity can also affects the image (blur effect).
Current systems that are used in more expensive vehicles
are part of driver support systems based on simplified image
processing. Examples are lane detection and traffic signs for
speed limitations. Still problems related to robustness to different outdoor environment conditions and real time processing
of high level features recognition exist.
Authors wish to thank Nikola Bakarić for his valuable
comments during writing this paper. This research has been
CCVW 2013
Poster Session
C. Hughes, R. O’Malley, D. O’Cualain, M. Glavin, and E. Jones, New
Trends and Developments in Automotive System Engineering. Intech,
2011, ch. Trends towards automotive electronic vision systems for
mitigation of accidents in safety critical situations, pp. 493–512.
E. Omerzo, S. Kos, A. Veledar, K. Šakić Pokrivač, L. Pilat, M. Pavišić,
A. Belošević, and Š. V. Njegovan, Deaths in Traffic Accidents, 2011.
Croatian Bureau of Statistics, 2012.
N. H. T. S. Administration, “National motor vehicle crash causation
survey,” U.S. Department of Transportation, Tech. Rep., Apr 2008.
S. Ribarić, J. Lovrenčić, and N. Pavešić, “A neural-network-based
system for monitoring driver fatigue,” in MELECON 2010 15th IEEE
Mediterranean Electrotechnical Conference, 2010, pp. 1356–1361.
M. Bertozzi, A. Broggi, M. Cellario, A. Fascioli, P. Lombardi, and
M. Porta, “Artificial vision in road vehicles,” in Proceedings of the
IEEE, vol. 90, no. 7, 2002, pp. 1258–1271.
R. Szeliski, Computer Vision: Algorithms and Applications. Springer
- Verlag London Limited, 2011.
Z. Sun, G. Bebis, and R. Miller, “On-road vehicle detection: A review,”
Transactions on Pattern Analysis and Machine Intelligence, vol. 28,
no. 5, pp. 694–711, May 2006.
P. Viola and M. Jones, “Rapid object detection using a boosted cascade
of simple features,” in Computer Vision and Pattern Recognition, vol. 1,
2001, pp. I–511–I–518 vol.1.
L. Lingling, C. Yangzhou, and L. Zhenlong, “Yawning detection for
monitoring driver fatigue based on two cameras,” in Proceedings of
the 12th International IEEE Conference on Intellligent Transportation
Systems, St. Louis, Missouri, USA, 4-7 Oct 2009, pp. 12–17.
C. Long, L. Qingquan, L. Ming, and M. Qingzhou, “Traffic sign
detection and recognition for intelligent vehicle,” in Proceedings of
IEEE Intelligent Vehicles Symposium, 2011, pp. 908–913.
D. Soendoro and I. Supriana, “Traffic sign recognition with color-based
method, shape-arc estimation and svm,” in Proceedings of International
Conference on Electrical Engineering and Informatics, 2011.
S. Tsugawa, “Vision-based vehicles in japan: the machine vision
systems and driving control systems,” Industrial Electronics, IEEE
Transactions on, vol. 41, no. 4, pp. 398–305, August 1994.
T. Quoc-Bao and L. Byung-Ryong, “New lane detection algorithm
for autonomous vehicles using computer vision,” in Proceedings of
International Conference on Control, Automation and Systems, ICCAS
2008., Seoul, Korea, 14-17 Oct 2008, pp. 1208–1213.
Y. Jiang, F. Gao, and G. Xu, “Computer vision-based multiple-lane detection on straight road and in a curve,” in Proceedings of International
Conference on Image Analysis and Signal Processing, Huaqiao, China,
9-11 April 2010, pp. 114–117.
W. Qiong, W. Huan, Z. Chunxia, and Y. Jingyu, “Driver fatigue
detection technology in active safety systems,” in Proceedings of the
2011 International Conference RSETE, 2011, pp. 3097–3100.
T. Williamson and C. Thorpe, “A trinocular stereo system for highway
obstacle detection,” in 1999 International Conference on Robotics and
Automation (ICRA ’99), 1999.
M. Bertozzi, A. Broggi, A. Fascioli, and R. Fascioli, “Stereo inverse
perspective mapping: Theory and applications,” Image and Vision
Computing Journal, vol. 8, pp. 585–590, 1998.
X. Baro, S. Escalera, J. Vitria, O. Pujol, and P. Radeva, “Traffic
sign recognition using evolutionary adaboost detection and forest-ecoc
classification,” Intelligent Transportation Systems, IEEE Transactions
on, vol. 10, no. 1, pp. 113–126, 2009.
D. Ding, J. Yoon, and C. Lee, “Traffic sign detection and identification
using surf algorithm and gpgpu,” in SoC Design Conference (ISOCC),
2012 International, 2012, pp. 506–508.
S. Šegvić, K. Brkić, Z. Kalafatić, and A. Pinz, “Exploiting temporal
and spatial constraints in traffic sign detection from a moving vehicle,”
Machine Vision and Applications, pp. 1–17, 2011. [Online]. Available:
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
Global Localization Based on 3D Planar Surface
Segments Detected by a 3D Camera
Robert Cupec, Emmanuel Karlo Nyarko, Damir Filko,
Andrej Kitanov, Ivan Petrović
Faculty of Electrical Engineering
J. J. Strossmayer University of Osijek
Osijek, Croatia
[email protected]
Faculty of Electrical Engineering and Computing
University of Zagreb
Zagreb, Croatia
[email protected]
Abstract—Global localization of a mobile robot using planar
surface segments extracted from depth images is considered. The
robot’s environment is represented by a topological map
consisting of local models, each representing a particular location
modeled by a set of planar surface segments. The discussed
localization approach segments a depth image acquired by a 3D
camera into planar surface segments which are then matched to
model surface segments. The robot pose is estimated by the
Extended Kalman Filter using surface segment pairs as
measurements. The reliability and accuracy of the considered
approach are experimentally evaluated using a mobile robot
equipped by a Microsoft Kinect sensor.
Keywords—global localization;
Extended Kalman Filter
The ability of determining its location is vital to any
mobile machine which is expected to execute tasks which
include autonomous navigation in a particular environment.
The basic robot localization problem can be defined as
determining the robot pose relative to a reference coordinate
system defined in its environment. This problem can be
divided into two sub-tasks: initial global localization and local
pose tracking. Global localization is the ability to determine
the robot's pose in an a-priori or previously learned map, given
no other information than that the robot is somewhere on the
map. Local pose tracking, on the other hand, compensates
small, incremental errors in a robot’s odometry given the
initial robot’s pose thereby maintaining the robot localized
over time. In this paper, global localization is considered.
There are two main classes of vision-based global
localization approaches, appearance-based approaches and
feature-based approaches.
In appearance-based approaches, each location in a robot's
operating environment is represented by a camera image.
Robot localization is performed by matching descriptors
assigned to each of these images to the descriptor computed
from the current camera image. The location corresponding to
the image which is most similar to the currently acquired
image according to a particular descriptor similarity measure
CCVW 2013
Poster Session
is returned by the localization algorithm as the solution. The
appearance-based techniques have recently been very
intensively explored and some impressive results have been
reported [1], [2].
In feature-based approaches, the environment is modeled
by a set of 3D geometric features such as point clouds [3],
points with assigned local descriptors [4], line segments [5],
[6], surface patches [7] or planar surface segments [8], [9],
[10], where all features have their pose relative to a local or a
global coordinate system defined. Localization is performed
by searching for a set of model features with similar properties
and geometric arrangement to that of the set of features
currently detected by the applied sensor. The robot pose is
then obtained by registration of these two feature sets, i.e. by
determining the rigid body transformation which maps one
feature set onto the other.
An advantage of the methods based on registration of sets
of geometric features over the appearance-based techniques is
that they provide accurately estimated robot pose relative to its
environment which can be directly used for visual odometry or
by a SLAM system.
In this paper, we consider a feature-based approach which
relies on an active 3D perception sensor. An advantage of
using an active 3D sensor in comparison to the systems which
rely on a 'standard' RGB camera is their lower sensitivity to
changes in lighting conditions. A common approach for
registration of 3D point clouds obtained by 3D cameras is
Iterative Closest Point (ICP) [11], [12], [13]. Since this
method requires a relatively accurate initial pose estimate, it
can be used for local pose tracking and visual odometry, but it
is not appropriate for global localization. Furthermore, ICP is
not suitable for applications where significant changes in the
scene are expected. Hence, we use a multi hypothesis
Extended Kalman Filter (EKF).
Localization methods based on registration of 3D planar
surface segments extracted from depth images obtained by a
3D sensor are proposed in [10] and [14]. In [14], a highly
efficient method for registration of planar surface segments is
Proceedings of the Croatian Computer Vision Workshop, Year 1
proposed and its application for pose tracking is considered. In
this paper, the approach proposed in [14] is adapted for global
localization and its performance is analyzed. The environment
model which is used for localization is a topological map
consisting of local metric models. Each local model consists
of planar surface segments represented in the local model
reference frame. Such a map can be obtained by driving a
robot with a camera mounted on it along a path the robot
would follow while executing its regular tasks.
The rest of the paper is structured as follows. In Section II,
the global localization problem is defined and a method for
registration of planar surface segments is described which can
be used for global localization. An experimental analysis of
the proposed approach is given in Section III. Finally, the
paper is concluded with Section IV.
The global localization problem considered in this paper
can be formulated as follows. Given an environment map
consisting of local models M1, M2, ..., MN representing
particular locations in the considered environment together
with spatial relations between them and a camera image
acquired somewhere in this environment, the goal is to
identify the camera pose at which the image is acquired. The
term 'image' here denotes a depth image or a point cloud
acquired by a 3D camera such as the Microsoft Kinect sensor.
Let SM,i be the reference frame assigned to a local model Mi.
The localization method described in this section returns the
index i of the local model Mi representing the current robot
location together with the pose of the camera reference frame
SC relative to SM,i. The camera pose can be represented by
vector w =  φT t T  , where φ is a 3-component vector
describing the orientation and t is a 3-component vector
describing the position of SC relative to SM,i. Throughout the
paper, symbol R(φ
φ) is used to denote the rotation matrix
corresponding to the orientation vector φ.
September 19, 2013, Zagreb, Croatia
Mi in the map, where the initial pose estimate is set to the zero
vector with a high uncertainty of the position and orientation.
The proposed algorithm returns the pose hypothesis with the
highest consensus measure [14] as the final result.
A. Detection and Representation of Planar Surface Segments
Depth images acquired by a 3D camera are segmented into
sets of 3D points representing approximately planar surface
segments using a similar split-and-merge algorithm as in [15],
which consists of an iterative Delaunay triangulation method
followed by region merging. Instead of a region growing
approach used in the merging stage of the algorithm proposed
in [15], we applied a hierarchical approach proposed in [16]
which produces less fragmented surfaces while keeping
relevant details. By combining these two approaches a fast
detection of dominant planar surfaces is achieved. The result
is a segmentation of a depth image into connected sets of
approximately coplanar 3D points each representing a segment
of a surface in the scene captured by the camera. An example
of image segmentation to planar surface segments is shown in
Fig. 1.
The parameters of the plane supporting a surface segment
are determined by least-square fitting of a plane to the
supporting points of the segment. Each surface segment is
assigned a reference frame with the origin in the centroid tF of
the supporting point set and z-axis parallel to the supporting
plane normal. The orientation of x and y-axis are defined by
the eigenvectors of the covariance matrix Σp computed from
the positions of the supporting points of the considered surface
segment within its supporting plane. The purpose of assigning
reference frames to surface segments is to provide a
framework for surface segment matching and EKF-based pose
estimation explained in Section Error! Reference source not
The basic structure of the proposed approach is the
standard feature-based localization scheme consisting of the
following steps:
feature detection,
feature matching,
hypothesis generation,
selection of the best hypothesis.
Features used by the considered approach are planar
surface segments obtained by segmentation of a depth image.
These features are common in indoor scenes, thus making our
approach particularly suited for this type of environments.
The surface registration algorithm considered in this paper
is basically the same as the one proposed in [14]. The only
difference is that instead of implementing visual odometry by
registration between the currently acquired image and the
previous image, global localization is achieved by registration
between the currently acquired image and every local model
CCVW 2013
Poster Session
Fig. 1. An example of image segmentation to planar surface segments: (a) RGB
image; (b) depth image obtained by Kinect, where darker pixels represent points
closer to the camera, while black points represent points of undefined depth; (c)
extracted planar surface segments delineated by green lines and (d) 3D model
consisting of dominant planar surface segments.
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
variable r is computed as the uncertainty of the surface
segment centroid tF in the direction of the segment normal n,
computed by
Let the true plane be defined by the equation
n ⋅ F p = Fρ ,
σ2r = nT ⋅ ΣC ( t F ) ⋅ n .
where n is the unit normal of the plane represented in the
surface segment reference frame SF, Fρ is the distance of the
plane from the origin of SF and Fp is an arbitrary point
represented in SF. In an ideal case, where the measured plane
is identical to the true plane, the true plane normal is identical
to the z-axis of SF, which means that Fn = [0, 0, 1]T, while
ρ = 0. In a general case, however, the true plane normal
deviates from the z-axis of SF and this deviation is described
by the random variables sx and sy, representing the deviation in
directions of the x and y-axis of SF respectively, as illustrated
in Fig. 2 for x direction. The unit normal vector of the true
plane can then be written as
The variances σ2sx and σ 2sy describing the uncertainty of the
segment plane normal are estimated using a simple model.
The considered surface segment is represented by a flat 3D
ellipsoid centered in the segment reference frame SF, as
illustrated in Fig. 3. Assuming that the orientation of the
surface segment is computed from four points at the ellipse
perimeter which lie on the axes x and y of SF, as illustrated in
Fig. 3, the uncertainty of the surface segment normal can be
computed from the position uncertainties of these four points.
According to this model the variances σ2sx and σ 2sy can be
computed by
s + s +1
 sx
s y 1
σ2sx ≈
Furthermore, let the random variable r represent the distance
of the true plane from the origin of SF, i. e.
ρ =r.
The uncertainty of the supporting plane parameters can be
described by the disturbance vector q = [sx, sy, r]T. We use a
Gaussian uncertainty model, where the disturbance vector q is
assumed to be normally distributed with 0 mean and
covariance matrix Σq. Covariance matrix Σq is a diagonal
matrix with variances σ2sx , σ 2sy and σ2r on its diagonal
describing the uncertainties of the components sx, sy and r
respectively. These variances are computed from the
uncertainties of the supporting point positions, which are
determined using a triangulation uncertainty model analogous
to the one proposed in [17]. Let this uncertainty be represented
by the 3×3 covariance matrix ΣC(p) assigned to each point
position vector p obtained by a 3D camera. In order to achieve
high computational efficiency, the centroid tF of a surface
segment is used as the representative supporting point and it is
assumed that the uncertainties of all supporting points of this
surface segment are similar to the centroid uncertainty. The
variance σ2r describing the uncertainty of the disturbance
λ1 + σ2r
σ2sy ≈
λ 2 + σ2r
where λ1 and λ2 are the two largest eigenvalues of the
covariance matrix Σp. Alternatively, a more elaborate
uncertainty model can be used, such as those proposed in [6]
and [10].
Finally, a scene surface segment is denoted in the
following by the symbol F associated with the quadruplet
F = ( CRF , C t F , Σ q , Σ p ) ,
where CRF and CtF are respectively the rotation matrix and
translation vector defining the pose of SF relative to the
camera coordinate system SC. Analogously, a local model
surface segment is represented by
F ′ = ( MRF ′ , M t F ′ , Σ q′ , Σ p′ ) ,
where index M denotes the local model reference frame.
B. Initial Feature Matching
The pose estimation process starts by forming a set of
surface segment pairs (F, F’), where F is a planar surface
Fig. 2. Displacement of the true plane from the measured plane.
CCVW 2013
Poster Session
Fig. 3. Plane uncertainty model.
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
segment detected in the currently acquired image and F’ is a
local model surface segment. A pair (F, F’) represents a
correct correspondence if both F and F’ represent the same
surface in the robot’s environment. The surface segments
detected in the currently acquired image are transformed into
the local model reference frame using an initial estimate of the
camera pose relative to this frame and every possible pair
(F, F’) of surface segments is evaluated according to the
coplanarity and overlap criteria explained in [14]. These two
criteria take into account the uncertainty of the assumed
camera pose. In the experiments reported in this paper, the
initial robot pose estimate used for feature matching is set to
zero vector with the uncertainty described by a diagonal
covariance matrix Σw,match = diag([ σ φ2 , σ t2 , σ t2 ]), where
σα = 20° and σt = 1 m describe the uncertainty of the robot
orientation and position in xy-plane of the world reference
frame respectively. This uncertainty is propagated to the
camera reference frame, taking into account the uncertainty of
the camera inclination due to the uneven floor. The deviation
of the floor surface from a perfectly flat surface is modeled by
zero-mean Gaussian noise with standard deviation σf = 5 mm.
C. Hypothesis Generation
Given a sequence of correct pairs (F, F’), the camera pose
relative to a local model reference frame can be computed
using the EKF approach. Starting from the initial pose
estimate, the pose information is corrected using the
information provided by each pair (F, F’) in the sequence.
After each correction step, the pose uncertainty is reduced.
This general approach is applied e.g. in [18] and [5]. Some
specific details related to our implementation are given in the
Let (F, F’) be a pair of corresponding planar surface
segments. Given a vector F'p representing the position of a
point relative to SF', the same point is represented in SF by
p = C RFT RT ( φ ) ( M RF ′ F ′p + M t F ′ − t ) − C t F ,
where w =  φT
t T  is an estimated camera pose relative to
a local model reference frame. By substituting (8) into (1) we
F′ T
n ⋅ F ′p = Fρ′
between the plane normals and their distances from the origin
of SF'. Assuming that F and F' represent the same surface, the
following equations hold
ρ = Fρ + F nT ⋅ C RFT
R R ( φ ) RF
ρ = Fρ′ ′ ,
where F n′ ′ and Fρ′ ′ are the parameters of the plane supporting
F' represented in reference frame SF'. Since F n′ and F n′ ′ are
unit vectors with two degrees of freedom, it is appropriate to
compare only their two components. We choose the first two
components to formulate the coplanarity constraint
 1 0 0  F ′
n − Fn
 0 1 0
ρ − ρ′
Note that the vector on the left side of equation (14) is actually
a function of the disturbance vectors q and q' representing the
uncertainty of the parameters of the planes supporting F and
F' respectively, the pose w and the estimated poses of F and F'
relative to SC and SM respectively. Hence, (14) can be written
h ( F , F ′, w , q, q′ ) = 0 .
Equation (15) represents the measurement equation from
which EKF pose update equations can be formulated using the
general approach described in [18].
This EKF-based procedure will give a correct pose
estimate assuming that the sequence of surface segment pairs
used in the procedure represent correct correspondences. Since
the initial correspondence set usually contains many false
pairs, a number of pose hypotheses are generated from
different pair sequences and the most probable one is selected
as the final solution. We use the efficient hypothesis
generation method described in [14]. The result of the
described pose estimation procedure is a set of pose
hypotheses ranked according to a consensus measure
explained in [14]. Each hypothesis consists of the index of a
local model to which the current depth image is matched and
the camera pose relative to the reference frame of this model.
t F + RT ( φ ) ( t − M t F ′ ) .
Vector F n′ and value Fρ′ are the normal of F represented in
SF' and the distance of the plane supporting F from the origin
of SF' respectively. The deviation of the plane supporting the
scene surface segment from the plane containing the local
model surface segment can be described by the difference
CCVW 2013
Poster Session
n′ ,
In this section, the results of an experimental evaluation of
the proposed approach are reported. We implemented our
system in C++ programming language using OpenCV library
[19] and executed it on a 3.40 GHz Intel Pentium 4 Dual Core
CPU with 2GB of RAM. The algorithm is experimentally
evaluated using 3D data provided by a Microsoft Kinect
sensor mounted on a wheeled mobile robot Pioneer 3DX also
equipped with a laser range finder SICK LMS-200. For the
purpose of this experiment, two datasets were generated by
manually driving the mobile robot on two different occasions
Proceedings of the Croatian Computer Vision Workshop, Year 1
Fig. 4. The map of the Department of Control and Computer Engineering,
(FER, Zagreb) obtained using SLAM and data from a laser range finder and
the trajectory of the wheeled mobile robot while generating images used in
creating the topological map.
through a section of a previously mapped indoor environment
of the Department of Control and Computer Engineering,
Faculty of Electrical Engineering and Computing (FER),
University of Zagreb. The original depth images of resolution
640 × 480 were subsampled to 320 × 240. Fig. 4 shows the
previously mapped indoor environment generated with the aid
of SLAM using data from the laser range finder together with
the trajectory of the robot when generating the first dataset.
The first dataset consists of 444 RGB-D images recorded
along with the odometry data. The corresponding ground truth
data as to the exact pose of the robot in the global coordinate
frame of the map was determined using laser data and Monte
Carlo localization. These images were used to create the
environment model – a database of local metric models with
topological links. This environment model or topological map
consisted of 142 local models, generated such that the local
model of the first image was automatically added to the map
and every consecutive image or local model added to the map
satisfied at least one of the following conditions: (1) the
translational distance between the candidate image and the
latest added local model in the map was at least 0.5 m or (2)
the difference in orientation between the candidate image and
the latest added local model in the map was at least 15°.
September 19, 2013, Zagreb, Croatia
as the environment model. Thus, all 792 images were tested,
among which 142 were used for model building. As explained
in Section II, each generated hypothesis provides the index of
the local model from the topological map as well as the
relative robot pose corresponding to the test image with
respect to the local model. This robot pose is referred to herein
as a calculated pose. By comparing the calculated pose of the
test image to the corresponding ground truth data, the
accuracy of the proposed approach can be determined. For
each test image, a correct hypothesis is considered to be one
where: (1) the translational distance between the calculated
pose and the ground truth data is at most 0.2m and; (2) the
absolute difference in orientation between the calculated pose
and the ground truth is at most 2°. Using this criterion, the
effectiveness of the proposed global localization method can
be assessed not only on the basis of the accuracy of the
solutions, but also on the minimum number of required
hypotheses that need to be generated in order to obtain at least
one correct hypothesis. Examples of images from both
datasets are given in Fig. 5.
An overview of the results of the initial global
localization experiment is given in Table I. Of the 792 test
images, the proposed approach was not able to generate any
hypothesis in 30 cases. In all 30 cases, the scenes were
deficient in information needed to estimate all six degrees of
freedom (DoF) of the robot's pose. Such situations normally
arise when the camera is directed towards a wall or door at a
close distance, e.g. while the robot is turning around in a
No hypothesis
No correct hypothesis
Correct hypothesis
Number of
In 44 cases, no correct hypothesis was generated by the
proposed approach. There were two main reasons for such a
situation: (1) the topological map did not contain a local
model covering the scene of the test image; (2) the existence
of repetitive structures in the indoor environment. An example
of such situations is a pair of scenes shown in the last column
The trajectory of the robot during the generation of the
second sequence was not the same as the first sequence but
covered approximately the same area. With the aid of
odometry information from the robot encoders, the second
sequence was generated by recording RGB-D images every
0.5 m or 5° difference in orientation between consecutive
images. The corresponding ground truth data was determined
using laser data and Monte Carlo localization and recorded as
well. This second dataset consisted of a total of 348 images.
The global localization procedure was performed for all
the images in both datasets with the topological map serving
CCVW 2013
Poster Session
Fig. 5. Examples of images used in the global localization experiment.
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
hypothesis in 90% of cases. On average, the first correct
hypothesis is the 4th ranked hypothesis. For the highest ranked
correct hypotheses, the error in position was on average
approximately 37 mm, while the difference in orientation was
on average approximately 0.6°. For 99% of these hypotheses,
the pose error was at most 164 mm and 1.9°.
Fig. 6 Normalized cumulative histogram of the error in position (top-left)
error in orientation (top-right) index of the first correct hypothesis (bottom).
of Fig. 5, where one can notice the similar repeating doorways
on the left side of the corridor.
The accuracy of the proposed approach is determined
using the 718 images with a correct hypothesis. The results are
shown statistically in Table II as well as in Fig. 6, in terms of
the absolute error in position and orientation between the
correct pose and corresponding ground truth pose of the test
sequence images as well as the index of the first correct
hypothesis. The error bounds as well as the number of highest
ranked hypotheses containing at least one correct hypothesis
for 99% of samples are specially denoted in Fig. 6.
Error (mm)
Error (°)
Index of the first
correct hypothesis
In this paper a global localization approach based on the
environment model consisting of planar surface segments is
discussed. Planar surface segments detected in the local
environment of the robot are matched to the planar surface
segments in the model and the robot pose is estimated by
registration of planar surface segment sets based on EKF. The
result is a list of hypotheses ranked according to a measure of
their plausibility. The considered approach is experimentally
evaluated using depth image sequences acquired by a
Microsoft Kinect sensor mounted on a mobile robot. The
analyzed approach generated at least one correct pose
CCVW 2013
Poster Session
M. Cummins and P. Newman, “Highly scalable appearance-only SLAM
- FAB-MAP 2.0,” in Proceedings of Robotics: Science and Systems,
Seattle, USA, 2009.
M. Milford, “Vision-based place recognition: how low can you go”, The
International Journal of Robotics Research, vol. 32, no. 7, 2013, pp.
S. Thrun, W. Burgard, and D. Fox, “Probabilistic Robotics”, The MIT
Press, 2005.
S. Se, D. G. Lowe, and J. J. Little, “Vision-based global localization and
mapping for mobile robots”, IEEE Transactions on Robotics, vol. 21, no.
3, 2005, pp. 364-375.
A. Kosaka and A. Kak, “Fast vision-guided mobile robot navigation
using model-based reasoning and prediction of uncertainties”, CVGIP:
Image Understanding, vol. 56, 1992, pp. 271–329.
O. Faugeras, “Three-Dimensional Computer Vision: A Geometric
Viewpoint”. Cambridge, Massachusetts: The MIT Press, 1993.
J. Stückler and S. Behnke, “Multi-Resolution Surfel Maps for Efficient
Dense 3D Modeling and Tracking”, Journal for Visual Communication
and Image Representation, 2013.
D. Cobzas and H. Zhang, “Mobile Robot Localization using Planar
Patches and a Stereo Panoramic Model”, in Proceedings of Vision
Interface, 2001, pp. 94-99.
M. F. Fallon, H. Johannsson, and J. J. Leonard, “Efficient scene
simulation for robust monte carlo localization using an RGB-D camera”,
in Proceedings of IEEE International Conference on Robotics and
Automation, 2012, pp. 1663–1670.
K. Pathak, A. Birk, N. Vaskevicius and J. Poppinga, “Fast Registration
Based on Noisy Planes with Unknown Correspondences for 3D
Mapping”, IEEE Transactions on Robotics, vol. 26, no. 3, 2010, pp. 424
– 441.
P. Besl and N. McKay, “A Method for Registration of 3-D Shapes,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.
14, no. 2, 1992.
S. Rusinkiewicz and M. Levoy, “Efficient Variants of the ICP
Algorithm”, in Proceedings of the 3rd Int. Conf. on 3D Digital Imaging
and Modeling (3DIM), 2001.
R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J.
Davison, P. Kohli, J. Shotton, S. Hodges and A. Fitzgibbon,
“KinectFusion: Real-Time Dense Surface Mapping and Tracking,” in
Proceedings of the 10th Int. Symp. on Mixed and Augmented Reality
(ISMAR), 2011.
R. Cupec, E. K. Nyarko, D. Filko and I. Petrović, “Fast Pose Tracking
Based on Ranked 3D Planar Patch Correspondences,” in Proceedings of
IFAC Symposium on Robot Control, Dubrovnik, Croatia, 2012.
F. Schmitt and X. Chen, “Fast segmentation of range images into planar
regions”, in Proceedings of the IEEE Computer Society Conf. on
Computer Vision and Pattern Recognition (CVPR), 1991, pp. 710-711.
M. Garland, A. Willmott and P. S. Heckbert, “Hierarchical Face
Clustering on Polygonal Surfaces,” in Proceedings of ACM Symposium
on Interactive 3D Graphics, 2001.
L. Matthies and S. A. Shafer, “Error Modeling in Stereo Navigation,”
IEEE Journal of Robotics and Automation, vol. 3, 1987, pp. 239 – 248.
N. Ayache and O. D. Faugeras, “Maintaining Representations of the
Environment of a Mobile Robot,” IEEE Trans. on Robotics and
Automation, vol. 5, no. 6, 1989, pp. 804–819.
G. Bradski and A. Kaehler, “Learning OpenCV”, O'Reilly, 2008.
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
Multiclass Road Sign Detection using Multiplicative
Valentina Zadrija
Siniša Šegvić
Mireo d. d.
Zagreb, Croatia
[email protected]
Faculty of Electrical Engineering and Computing
University of Zagreb
Zagreb, Croatia
[email protected]
Abstract—We consider the problem of multiclass road sign
detection using a classification function with multiplicative kernel
comprised from two kernels. We show that problems of detection
and within-foreground classification can be jointly solved by
using one kernel to measure object-background differences and
another one to account for within-class variations. The main idea
behind this approach is that road signs from different foreground
variations can share features that discriminate them from
backgrounds. The classification function training is accomplished
using SVM, thus feature sharing is obtained through support
vector sharing. Training yields a family of linear detectors, where
each detector corresponds to a specific foreground training
sample. The redundancy among detectors is alleviated using kmedoids clustering. Finally, we report detection and classification
results on a set of road sign images obtained from a camera on a
moving vehicle.
Keywords—multiclass object detection; object classification;
road sign; SVM; multiplicative kernel; feature sharing; clustering
Road sign detection and classification is an exciting field of
computer vision. There are various applications of the road
sign detection and classification in driving assistance systems,
autonomous intelligent vehicles and automated traffic
inventories. The latter one is of particular interest to us since
traffic inventories include periodical on-site assessment carried
out by trained safety expert teams. Current road condition is
then manually compared against the reference state in the
inventory. In current practice, the process is tedious, error
prone and costly in terms of expert time. Recently, there have
been several attempts to at least partially automate the process
[1], [2]. This paper presents an attempt to partially automate
this process in terms of road sign detection and classification
using the multiplicative kernel.
In general, detection and classification are typically two
separate processes. The most common detection method is
sliding window approach, where each window in the original
image is evaluated by a binary classifier in order to determine
whether the window contains an object. When the prospective
object locations are known, classification stage is applied only
on the resulting windows. One common classification approach
includes partitioning the object space into subclasses and then
training a dedicated classifier for each subclass. This approach
is known as one-versus-all classification.
In this paper, we focus on detection and classification of
ideogram-based road signs as a one-stage process. We employ
a jointly learned family of linear detectors obtained through
Support Vector Machine (SVM) learning with multiplicative
kernel as presented in [3]. In its original form, SVM is a binary
classifier, but multiplicative kernel formulation enables the
multiclass detection. Multiplicative kernel is defined as a
product of two kernels, namely between-class kernel kx and
within-class kernel k as described in Section IV. Betweenclass kernel kx is dedicated for detection, i.e. the foregroundbackground classification. Within-class kernel k is used for
within-foreground classification, i.e. to discriminate between
various object subclasses. The result of SVM training is a set of
support vectors and corresponding weights which are then used
to generate a family of detectors. The detectors are obtained by
tuning within-class state into the multiplicative kernel as
described in Section IV-B. The key point here is that all
detectors share the same support vectors, but weights of
specific support vectors vary depending on the within-class
state value.
We evaluate the described approach on a set of predefined
road sign subclasses defined in Section III and present
experimental results in Section V.
In recent years, a lot of work has been proposed in the area
of multiclass object detection. We will address the issues from
both road sign detection aspect as well as general multiclass
object detection and feature sharing.
The approach presented in [4] tries to solve multi-view face
detection problem. Foreground class is partitioned into
subclasses according to variations in face orientation. For each
subclass, a corresponding detector is learned. This approach
exhibits several problems when used with a larger number of
subclasses. More specifically, number of features, number of
false positives and the total training time grow linearly with the
number of subclasses.
This work has been supported by research projects Vista
(EuropeAid/131920/M/ACT/HR) and Research Centre for Advanced
Cooperative Systems (EU FP7 #285939).
CCVW 2013
Poster Session
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
Fig. 1. Distribution of samples with respect to road sign subclasses S = {1, 2,
3, 4, 5} for training dataset. Below each subclass label v, representative
subclass members are shown. Road signs shown for subclass v=1 are
informal, i.e. subclass contains 9 different road signs.
Further, authors in [5] focus on feature sharing using
JointBoost procedure. In contrast to [4], where number of
features grows linearly with the number of subclasses, the
authors have experimentally shown that the number of features
grows logarithmically with respect to the number of subclasses.
Additionally, the authors showed that a jointly trained detector
requires a significantly smaller number of features in order to
achieve the same performance as a independent detector.
The approach presented in [6] deals with problem of multiview detection using the so called Vector Boosted Tree
procedure. At the detection time, the input is examined by a
sequence of nodes starting from the root node. If the input is
classified by the detector of current node as a member of the
object class, it is passed to the node children. Otherwise, it is
rejected as a background. The drawback of this approach is that
it requires the user to predefine the tree structure and choose
the position to split.
Similar to [6], the approach presented in [7] also employs a
classifier structured in a form of a tree. However, in contrast to
[6], the tree classifier is constructed automatically - the node
splits are achieved through unsupervised clustering. The
algorithm is iterative, i.e. at the beginning it starts with an
empty tree and training samples which are assigned weights.
By adding a node to the tree, the sample weights are modified
accordingly. Additionally, if a split is achieved, parent
classifiers of all nodes along the way to the root node are
The concept of feature sharing is also explored through
shape-based hierarchical compositional models [8], [9], [10].
These models are used for object categorization [8], [10], but
also for multi-view detection, [9]. Different object categories
can share parts or appearance. Parts on lower hierarchy levels
are combined into larger parts on higher levels. In general,
parts on lower levels are shared amongst various object
categories, while those in higher levels are more category
specific. This is similar to approach employed in this paper,
however, in this paper, the feature sharing is obtained through a
single-level support vector sharing.
The approach presented in [1] describes a road sign
detection and recognition system based on sharing features.
The detection subsystem is a two stage process comprised of
color-based segmentation and SVM shape detection. The
recognition subsystem comprises GentleBoost algorithm and
Rotation-Scale-Translation Invariant template matching.
CCVW 2013
Poster Session
In this paper, we focus on detection and classification of
ideogram-based road signs. The class of all road signs exhibits
various foreground variations with respect to the sign shape but
also on presence or absence of thick red rim and ideogram
type. We aim to jointly train detectors to discriminate road
signs from backgrounds as well as to produce foreground
variation estimates, i.e. subclass labels. We will use terms
"foreground variation" and "subclass" interchangeably in the
rest of the paper denoting the same concept. Fig. 1 depicts
within-class road sign subclasses which we aim to estimate.
The class of all road signs is comprised out of five
variations denoted with label v from S = {1, 2, 3, 4, 5}.
Subclass v=1 includes triangular warning signs with thick red.
A small subset of the subclass members is shown in Fig. 1, i.e.
the subclass contains nine different road signs. These signs
belong to the category A according to the Vienna Convention
[11]. Subclass v=2 contains circular "End of no overtaking
zone" sign which belongs to the category C of informative
signs. On the other hand, members of subclass v=3 "Priority
road" and "End of Priority Road" are rhomb-shaped. Subclass
v=4 includes square-shaped signs which are a subset of
informative category C road signs. Finally, the subclass v=5
conatins circular "Speed Limit" signs characterized with the
thick red rim which belongs to the category B of prohibitory
signs. We discuss the described road-sign variations in our
dataset and the motivation for our approach as follows.
First, we discuss motivation for partitioning road signs into
subclasses according to the distribution shown in Fig. 1. The
dataset is extracted from the video recorded with camera
mounted on top of a moving vehicle. Video sequences are
recorded at daytime, at different weather conditions [2].
Further, as Fig. 1 shows, the distribution of signs in the dataset
is unbalanced, i.e. certain variations like triangular signs are
characterized with large number of instances, while some
others have a small number of occurrences. In particular, the
"End of Priority Road" sign shown as a part of the subclass
v=3 in Fig. 1 has only nine instances in training dataset. In the
approach where we build a single detector for a particular
subclass, it is clear that detector trained with only nine samples
would have very poor detection rate. However, the "End of
Priority Road" sign shares the same shape as the "Priority
Road" sign. If we were to group them into a single subclass, we
could exploit foreground-variation feature sharing. Other
heterogeneous subclasses are designed with the same
motivation. Note that the subclass v=5 is also a heterogeneous
subclass, i.e. it contains various speed limit signs which share
the thick red rim and the zero digit, since the speed limits are
usually multiples of ten.
Second, according to the subclasses defined in Fig. 1,
observe that signs belonging to different subclasses also share
similarities. For example, the "End of no overtaking zone"
(subclass v=2) sign and "End of Priority Road" (subclass v=3)
both share the same distinctive crossover mark. This similarity
could improve discrimination capability of both signs with
respect to the background class. This suggests that if we solve
detection and classification problem for all subclasses together,
we could benefit from within-class feature sharing.
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
Therefore, due to the described characteristics of the dataset
distribution, as well as the nature of sign similarities, we
decided to employ a method presented in [3] where a
classification function is learned jointly for all within-class
variations. The aim of this approach is to form the
classification function which could exploit the fact that
different variations share features against backgrounds, but at
the same time provide within-class discrimination.
The overall road sign detection and classification process is
shown in Fig. 2. The detailed description is as follows.
For a given feature vector x ∈ ℝn computed for an image
patch, the goal is to decide whether it represents an instance of
a road sign class and, if so, to produce the corresponding
subclass estimate v from S = {1, 2, 3, 4, 5}. Let xi denote
feature vector of the i-th road sign training sample belonging to
the subclass with label vi. The feature vectors are given as
HOG features [15]. According to the above defined parameters,
the classification function C(x,i) is defined as follows:
 0, x is foreground from the same subclass as xi
 C ( x, i) 
  0, otherwise
This corresponds to the non-parametric approach presented
in [3]. The parametric approach [3] is simpler, however, it
requires each subclass v to be described with a specific
parameter. The role of parameter is to describe members of the
subclass in a unique way. In multiview domain, the parameter
typically corresponds to the view angle or object pose.
However, in the road sign domain the subclasses are
heterogeneous and cannot be described with a single
parameter. For this reason, the classification function employs
the foreground sample feature vector xi in order to describe a
specific subclass. Since the road sign subclasses are designed
with the goal that signs within the subclass are similar, it is to
be expected that they will also exhibit similar feature vector
values xi.
In order to satisfy requirements of within-class feature
sharing against the background class, as well as within-class
discrimination, the classification function C(x,i) is represented
as a product of two kernels:
C ( x, i) 
 k ( xs , xi )  k x ( xs , x) 
Parameter s denotes the Lagrange multiplier of the s-th
support vector [12], xs the particular support vector, xi the i-th
foreground training sample, while k(xs, xi) denotes the withinclass kernel and kx(xs, xi) the between-class kernel. In addition,
the product of first two terms in equation (2) can be
summarized into a single term s'(i), which denotes the weight
of the s-th support vector xs for foreground sample xi:
 s ' (i)   s  k ( xs , xi ) 
As a result, the support vectors for which the within-class
kernel kyields higher values will have a larger influence on
CCVW 2013
Poster Session
Fig. 2. Training and detector construction outline
the classification function. In this way, we achieve the withinclass feature sharing as well as the within-class discrimination.
A. Classification Function Training
The classification function (2) training is achieved using
SVM. The training samples take form of tuples (x, i). Each
foreground training sample x is assigned its corresponding
sample index i. Background training samples x are obtained
from image patches without road signs. Each background
training sample x can be associated with any index of a
foreground training sample in order to form a valid tuple. More
specifically, background samples x is a negative with respect to
all foreground samples. The number of such combinations is
huge and corresponds to
# ( NB)  # ( NF ) 
The parameter #(NB) corresponds to the total number of
backgrounds and #(NF) to the total number of foreground
training samples. Due to combinatorial complexity, including
all negative samples in SVM training would not be practical.
Therefore, the bootstrap training is employed as a hard
negative mining technique. In bootstrap training, only #(NB)
negatives are initially included in training. These samples are
assigned foreground sample indices in a random fashion. After
each training round, all negative samples are evaluated by the
function (2). False positives are added to the
negative set and the SVM training is repeated. This is an
iterative process which converges when there are no more false
positives to add.
B. Individual Detector Construction
C(x,i) is learned as a function of a foreground variation
parameter i, rather than learning separate detectors for each i.
Individual detectors w(x,i) are obtained from the classification
function by plugging in specific foreground sample values xi
into (2)
 and (3)
w(x, i) 
 '(i)  k ( x , x) 
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
Note that with fixed foreground variation i and known set
of support vectors xs, we can precompute the within-class
kernel values k(xs, xi) and consequently support vector weights
s'(i). Therefore, the within-class kernel kis not evaluated at
detection time. Rather, at detection time, we only evaluate the
between-class kernel kx. This fact affects our choice for withinclass and between-class kernels.
First, we discuss within-class kernel k. Road sign
subclasses are difficult to separate and is therefore important
for k to be able to separate nonlinear problems. Therefore,
Gaussian RBF kernel was chosen for that purpose
k ( x, xi )  exp(   D( x, xi )) 
where D(x,xi) denotes Euclidian distance and 
corresponding parameter. Due to the fact that RBF is evaluated
only during training and detector construction, this choice
doesn't impose performance penalty during detection.
Secondly, since the between-class kernel kx is evaluated
during detection, it is important for kx to be fast. Therefore, we
chose linear kernel for that purpose. By substituting the linear
kernel formulation kx(x,xi) = xiT∙x into (5) we obtain the final
form of our detectors:
w(x, i) 
 ( )  ( x
 x)  w(i)  x 
In this way, the detection is achieved by applying a simple
dot product between the image patch and the detector weights
denoted as w(i). Note that all detectors share the same set of
support vectors. In this way, feature sharing among various
detectors is achieved.
C. Detection Approach
In the detection process, we employ the well known sliding
window technique. Each window is evaluated by a family of
linear detectors w(x,i) constructed according to (7). From all
detector responses, the one with maximum value is chosen as a
result. If this value is positive, the window is classified as a
road sign belonging to the subclass of xi. Otherwise, the
window is discarded as a background. Note that this is similar
to the k-nearest neighbors (k-NN) method, with parameter k=1,
i.e. the object is simply assigned to the class of the nearest
neighbor selected among all detectors [13].
However, in the 1-NN approach, the number of evaluated
detectors is significant, i.e. corresponds to the number of
foreground samples #(NF). Evaluating all #(NF) detectors at
the detection stage would make the detection extremely slow.
In addition, since foreground samples belonging to the same
subclass are similar, there may be redundancy among detectors.
In order to identify a representative set from a family of total
#(NF) detectors, we use the k-medoids clustering technique.
The k-medoids technique is chosen due to its simplicity and
also because it is less prone to outlier influence than, for
instance, k-means method. Clustering yields a set of k < #(NF)
medoids which are then used in the detection phase. In
clustering, each detector w(x, i) is represented with a vector of
its support vector weights:
CCVW 2013
Poster Session
 ' (i)  1' (i),  2' (i), ,  SV
(i) 
where s'(i), s{1,..., SV} denotes the particular support
weight defined with (3), while SV denotes total number of
support vectors. As a distance measure, we used Dist (i, j )
defined as follows:
Dist (i, j )  1 
 ' (i)   ' ( j )
 ' (i)   ' ( j )
The appropriate number of medoids k is chosen in an
process, where we decrease number of medoids
gradually and measure the clustering quality. Initially, the
target number of medoids, i.e. cluster centers k is set to 50% of
the initial number of detectors, i.e. foreground samples. This
number was chosen by a rule of thumb. Then, we apply
clustering according to the chosen number of medoids k. In
order to measure clustering quality, we compute corresponding
silhouette value [14]. The silhouette value provides an estimate
of how the well obtained medoids represent the data in their
corresponding clusters. In each iteration, the target number of
clusters is decreased by a certain factor and the above process
is repeated. Final clustering outline is chosen as the one which
yields the best silhouette value.
In this section, we describe the evaluation of the above
described method according to the foreground variation
distribution presented in Section III. In our experiments, we
used three disjoint datasets, described below [2].
As we already stressed, the distribution shown in Fig. 1 is
unbalanced, i.e. subclass v=1 contains at least three times more
samples that other classes. In such unbalanced datasets, there is
a possibility for specific road sign variation with smaller
number of samples to be treated as a noise, e.g. subclass v=3.
In order to test this hypothesis, we experimented with the
number of samples per class. Namely, we observed detection
performance for NPOS=200, 300 and 400 samples per subclass.
For the subclasses with the number of samples lower than
NPOS, we simply use all the samples for the subclass, e.g.
subclasses v=2, v=3 and NPOS=400. The training samples are
extracted randomly from training dataset pool comprising 2153
road signs. Negative dataset contains 4000 image patches
extracted randomly from images of background outdoor
scenes. In order to monitor clustering performance, we use the
test dataset comprised out of 3000 cropped images. Details of
the test dataset are given in Section V-B.
A. Training Approach
The training of detectors is achieved as described in Section
IV. As features, we used HOG vectors [15] computed from
training images cropped and scaled to 24 x 24 pixels. Cell size
is set to 4 pixels, where each cell is normalized within a block
of four cells. In order to increase performance, we use block
overlapping with block stride set to size of single cell. The
training is achieved using SVMlight [16] with multiplicative
kernel. In contrast to [3], where parameter  of the RBF kernel
Proceedings of the Croatian Computer Vision Workshop, Year 1
Fig. 3. Examples of a detection and classification: (a) correct classification,
(b) correct classification and false positive.
(6) is set to a fixed value, we perform cross validation on
training dataset in each bootstrapping round in order to obtain
the best value. We compared both training approaches and the
one with optimized  value yields a lower number of support
vectors. This suggests a better mapping in the transformed
feature space. More specifically, with training set comprised
out of 1325 positive samples and 4000 negative samples, the
training with the fixed  value yields a set of 857 support
vectors. On the other hand, the training with the optimized 
decreases the number of support vectors for 10%.
The training yields a family of #(NF) detectors. The #(NF)
corresponds to the number of foreground training samples
which depends on the number of training samples per subclass
NPOS, Table I. Due to performance reasons, we use k-medoids
clustering in order to select a representative set of detectors
from total #(NF) detectors. The clustering is implemented in
Matlab according to the Partitioning Around Medoids method
[17]. In each clustering iteration, we monitor the silhouette
value and the validation results on the test dataset comprised
out of cropped images. Note that this dataset is disjoint from
test dataset used for detection. Interestingly, better silhouette
values correspond to a smaller number false negatives obtained
from validation on test data. The results of clustering are
summarized in Table I. depending on the NPOS and #(NF).
Resulting number of detectors k corresponds approximately to
30% of total training samples #(NF). Lower k values lead to
poor validation results and small silhouette values.
B. Detection and Classification Results
The test dataset for detection evaluation contains 1038
images in 720x576 resolution. From the total 1038 images, we
selected 200 images and used them for performance evaluation.
In these images, there were 214 physical road signs.
Table II. shows the detection and foreground estimation
results for the three case studies. We report the detection rate
D, the classification rate C, the false positive rate FP and the
false positive rate per image FP/I. D is defined as a number of
detected signs with respect to the total number of signs, while
C and FP correspond to the number of correct classifications
and false detections with respect to the total number of signs,
respectively. FP/I corresponds to the number of false detections
with respect to the number of images. Columns denoted with 
sign show differences in above metrics with respect to
configuration denoted by NPOS=200.
We started the experiment with NPOS=200 samples per
subclass. This configuration achieves overall detection rate of
CCVW 2013
Poster Session
September 19, 2013, Zagreb, Croatia
90% with false positive rate of 43%. Next, NPOS=300 achieves
4% rise in detection rate giving total 94%, and 3% rise in
classification rate, i.e. 93%. However, it is characterized with a
large false positive rate of 55%. In sliding window detection,
some computer vision libraries like OpenCV [18] employ false
positive detection policy where an object must exhibit at least n
detections in order to be accounted as a result. This is
understandable, since sliding window technique exhibits
multiple responses around single object. In this work, we didn't
experiment with this property, however we believe that it
would decrease false positive rate. Finally, the configuration
NPOS=400 exhibits worse results with respect to NPOS=300.
This supports our hypothesis that the unbalanced dataset
NPOS=400 treats subclasses with a lower number of samples as
a noise giving the lower overall detection rate.
Table III. depicts D and C rates, as well as the FP rate per
specific subclass v for configurations NPOS=300 and NPOS=200,
respectively. The subclass v=1 achieves better results when a
larger number of samples is used. This is understandable, since
this subclass comprises a large number of different road signs.
Interestingly, the subclass v=3 which has only 98 samples (8%
of the total samples for subclass v=1) achieves detection and
classification rate of 100% in all case studies. Subclasses v=2
and v=4 achieve lower detection and classification rates with
respect to other subclasses. Subclass v=2 is circle-shaped but
lacks red rim in order to share features with subclass v=5. On
the other hand, subclass v=4 is rectangle-shaped and gains less
benefit from feature sharing with other subclasses. Subclass
v=5 obtains similar results for NPOS=300 and NPOS=200. FP
distribution per subclass for NPOS=200 shows that subclass v=4
exhibits the largest number of false positives, i.e. 58%.
Examples of false positives classified as members of subclass
v=4 include building windows. On the other hand, NPOS=300
yields a rather balanced FP distribution, where subclasses v=1,
4 and 5 obtain FP rate of approximately 30%. Examples of
detection and classification are given in Fig. 3a and Fig. 3b.
Fig. 3a illustrates an example of a correct classification, where
"Speed Limit" sign is classified as a member of subclass v=5
(orange dotted line), while the "Children" sign is classified as a
member of subclass v=1 (green dotted line). Fig. 3.b illustrates
correct classification, as well as two false positives. The
"Priority Road" sign is correctly classified as a subclass v=3
(cyan dotted line). The "Weight Limit" sign was not present in
training data, however, due to similarity with "Speed Limit"
sign, it is classified as a member of subclass v=5 (orange
dotted line). The latter one indicates within-class feature
sharing. The triangle-like object is incorrectly classified as a
member of subclass v=1 (green dotted line).
In this paper, we considered a road-sign detection technique
based on a multiplicative kernel. One of the major challenges
was a poorly balanced dataset, where triangular warning signs
have at least three times more instances than other subclasses.
Our approach is based on a premise that different sign
subclasses share features which discriminate them from
backgrounds. Therefore, instead of learning a dedicated
detector for each subclass, we trained single classification
function for all subclasses using SVM [3]. Individual detectors
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
NPOS = 300
NPOS = 200
are afterwards constructed from a shared set of support vectors.
Major benefit of such approach with respect to separately
trained detectors lies in feature sharing which enhances the
detection rate for subclasses with lower number of samples.
In comparison to partition based approaches, this approach
does not require the training samples to be labeled with
subclass parameters in order to learn classification function.
This fact has proven to be useful for our domain, since the road
sign subclasses are heterogeneous and it is hard to describe a
subclass with a single parameter. Instead, each training sample
is labeled with its corresponding HOG feature vector. In this
way we obtain #(NF) subclass parameters and consequently
#(NF) detectors from the classification function, where #(NF)
corresponds to the number of foreground samples. However,
due to performance issues, we perform clustering in order to
obtain a representative set of detectors from a set of total #(NF)
detectors. The reduced set of detectors is used in detection.
Using the described method, we achieved the best detection
rate of 94% at a relatively high false positive rate of 55%. We
experimented with different numbers of samples per subclass
in order to observe the effect on detection rate. The obtained
results showed that detectors trained on a limited number of
samples, i.e. 300 samples per class obtain better detection
results with respect to the larger number of samples. Due to the
fact that this method has shown promising results in road sign
domain, in future work we plan to explore its applicability in
the domain of multiview vehicle detection.
The authors wish to thank Josip Krapac for useful
suggestions on early versions of this paper.
CCVW 2013
Poster Session
J.-Y. Wu, C.-C. Tseng, C.-H. Chang, J.-J. Lien, J.-C. Chen, and C. T.
Tu, “Road sign recognition system based on GentleBoost with sharing
features,” in System Science and Engineering (ICSSE), 2011
International Conference on, 2011, pp. 410–415.
S. Šegvić, K. Brkić, Z. Kalafatić, and A. Pinz, “Exploiting temporal and
spatial constraints in traffic sign detection from a moving vehicle,”
Machine Vision and Applications, pp. 1–17, 2011. [Online]. Available:
Q. Yuan, A. Thangali, V. Ablavsky, and S. Sclaroff, “Learning a family
of detectors via multiplicative kernels,” Pattern Analysis and Machine
Intelligence, IEEE Transactions on, vol. 33, no. 3, pp. 514–530, 2011.
P. Viola and M. Jones, “Fast Multi-View Face Detection,” in Proc. of
IEEE Conf. Computer Vision and Pattern Recognition, 2003.
A. Torralba, K. Murphy, and W. Freeman, “Sharing visual features for
multiclass and multiview object detection,” Pattern Analysis and
Machine Intelligence, IEEE Transactions on, vol. 29, no. 5, pp. 854–
869, 2007.
C. Huang, H. Ai, Y. Li, and S. Lao, “High-performance rotation
invariant multiview face detection,” Pattern Analysis and Machine
Intelligence, IEEE Transactions on, vol. 29, no. 4, pp. 671–686, 2007.
B. Wu and R. Nevatia, “Cluster boosted tree classifier for multi-view,
multi-pose object detection,” in Computer Vision, 2007. ICCV 2007.
IEEE 11th International Conference on, 2007, pp. 1–8.
S. Fidler and A. Leonardis, “Towards scalable representations of object
categories: Learning a hierarchy of parts,” in Computer Vision and
Pattern Recognition, 2007. CVPR ’07. IEEE Conference on, 2007, pp.
L. Zhu, Y. Chen, A. Torralba, W. Freeman, and A. Yuille, “Part and
appearance sharing: Recursive compositional models for multiview,” in
Computer Vision and Pattern Recognition (CVPR), 2010 IEEE
Conference on, 2010, pp. 1919–1926.
Z. Si and S.-C. Zhu, “Unsupervised learning of stochastic and-or
templates for object modeling,” in ICCV Workshops, 2011, pp. 648–
Inland transport comitee, “Convention on road signs and signals,”
Economic comission for Europe, 1968.
C. M. Bishop, Pattern Recognition and Machine Learning (Information
Science and Statistics). Secaucus, NJ, USA: Springer-Verlag New York,
Inc., 2006.
D. Bremner, E. Demaine, J. Erickson, J. Iacono, S. Langerman, P.
Morin, and G. Toussaint, “Output-sensitive algorithms for computing
nearest-neighbour decision boundaries,” in Algorithms and Data
Structures, ser. Lecture Notes in Computer Science, F. Dehne, J.-R.
Sack, and M. Smid, Eds. Springer Berlin Heidelberg, 2003, vol. 2748,
pp. 451–461. [Online]. Available: 3-54045078-8 39
P. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and
validation of cluster analysis,” J. Comput. Appl. Math., vol. 20, no. 1,
pp. 53–65, Nov. 1987. [Online]. Available:
N. Dalal and B. Triggs, “Histograms of oriented gradients for human
detection,” in Computer Vision and Pattern Recognition, 2005. CVPR
2005. IEEE Computer Society Conference on, vol. 1, 2005, pp. 886–893
vol. 1.
T. Joachims, “Advances in kernel methods,” B. Scholkopf, C. J. C.
Burges, and A. J. Smola, Eds. Cambridge, MA, USA: MIT Press, 1999,
ch. Making large-scale support vector machine learning practical, pp.
169–184. [Online]. Available:
S. Theodoridis and K. Koutroumbas, Pattern Recognition, Fourth
Edition, 4th ed. Academic Press, 2008.
G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal of Software
Tools, 2000.
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
A novel georeferenced dataset
for stereo visual odometry
Ivan Krešo
Marko Ševrović
Siniša Šegvić
University of Zagreb Faculty of
Electrical Engineering and Computing
Zagreb, HR-10000, Croatia
Email: [email protected]
University of Zagreb Faculty of
Transport and Traffic Engineering
Zagreb, HR-10000, Croatia
Email: [email protected]
University of Zagreb Faculty of
Electrical Engineering and Computing
Zagreb, HR-10000, Croatia
Email: [email protected]
Abstract—In this work, we present a novel dataset for assessing the accuracy of stereo visual odometry. The dataset
has been acquired by a small-baseline stereo rig mounted on
the top of a moving car. The groundtruth is supplied by a
consumer grade GPS device without IMU. Synchronization and
alignment between GPS readings and stereo frames are recovered
after the acquisition. We show that the attained groundtruth
accuracy allows to draw useful conclusions in practice. The
presented experiments address influence of camera calibration,
baseline distance and zero-disparity features to the achieved
reconstruction performance.
Visual odometry [1] is a technique for estimating egomotion [2] of a monocular or multiple camera system from
a sequence of acquired images. The technique is interesting
due to many interesting applications such as autonomous
navigation or driver assistance, but also because it forms the
basis for more involved approaches which rely on partial or full
3D reconstruction of the scene. The term was coined as such
due to similarity with classic wheel odometry which is a widely
used localization technique in robotics [3]. Both techniques
estimate the current location by integrating incremental motion
along the path, and are therefore subject to cumulative errors
along the way. However, while the classic odometry relies
on rotary wheel encoders, the visual odometry recovers incremental motion by employing correspondences between pairs
of images acquired along the path. Thus the visual odometry
is not affected by wheel slippage in uneven terrain or other
poor terrain conditions. Additionally, its usage is not limited
to wheeled vehicles. On the other hand, the visual odometry
can not be used in environments lacking enough textured
static objects such as in some tight indoor corridors and at
sea. Visual odometry is especially important in places with
poor coverage of GNSS (Global Navigation Satellite System).
signal, such as in tunnels, garages or in space. For example,
NASA space agency uses visual odometry in Mars Exploration
Rovers missions for precise rover navigation on Martian terrain
[4], [5].
We consider a specific setup where a stereo camera system
is mounted on top of the car in the forward driving direction.
Our goal is to develop a testbed for assessing the accuracy of
various visual odometry implementations, which shall further
be employed in higher level modules such as lane detection,
lane departure or traffic sign detection and recognition. We
decided to acquire our own GPS-registered stereo vision
CCVW 2013
Poster Session
dataset since, to our best knowledge, none of the existing
freely available datasets [6], [7] features a stereo-rig with intercamera distance less than 20 cm (this distance is usually termed
baseline). Additionally, we would like to be able to evaluate
the impact of our camera calibration to the accuracy of the
obtained results. Thus in this work we present a novel GPSregistered dataset acquired with a small-baseline stereo-rig (12
cm), the setup employed for its acquisition, as well as the
results of some preliminary research.
In comparison with other similar work in this field [7],
[8], we rely on low budget equipment for data acquisition. The
groundtruth motion for our dataset has been acquired by a consumer grade GPS receiver. The employed GPS device doesnt
have an inertial measurement unit, which means that we do not
have access to groundtruth rotation and instead record only the
WGS84 position in discrete time units. Additionally, hardware
synchronization of the two sensors can not be performed due
to limitations of the GPS device. Because of that, the alignment of the camera coordinates with respect to the WGS84
coordinates becomes harder to recover as will be explained
later in the article. Thus our acquisition setup is much more
easily assembled at the expense of more post-processing effort.
However we shall see that the attained groundtruth accuracy
is quite enough for drawing useful conclusions about several
implementation details of the visual odometry.
Our sensor setup consists of a stereo rig and a GPS
receiver. The stereo rig has been mounted on top of the car,
as shown in Fig. 1. The stereo rig (PointGrey Bumblebee2)
Fig. 1.
The stereo system Bumblebee2 mounted on the car roof.
Proceedings of the Croatian Computer Vision Workshop, Year 1
features a Sony ICX424 sensor (1/3”), 12 cm baseline and
global shutter. It is able to acquire two greyscale images
640×480 pixels each. The shutters of the two cameras are
synchronized, which means that the two images of the stereo
pair are acquired during the exactly same time interval (this
is very important for stereo analysis of dynamic scenes). Both
images of the stereo pair are transferred over one IEEE 1394A
(FireWire) connection. The firewire connector is plugged into
a PC express card connected to a laptop computer. The camera
requires 12V power over the firewire cable. The PC express
card is unable to draw enough power from the notebook and
therefore features an external 12V power connector which we
attach to the cigarette lighter power plug by a custom cable. In
order to avoid overloading of the laptop bus, we set the frame
rate of the camera to 25 Hz. The acquired stereo pairs are in
the form of 640×480 pairs of interleaved pixels (16 bit), which
means that upon acquisition the images need to be detached by
placing each odd byte into the left image and each even byte
into the right image. The camera firmware places timestamps
in first four pixels of each stereo frame. These timestamps
contain the value of a highly accurate internal counter at the
time when the camera shutter was closed.
The employed GPS receiver (GeoChron SD logger) delivers location readings at 1 Hz. It is a consumer-grade device
with a basic capability for multipath detection and mitigation.
The GPS antenna was mounted on the car roof in close
proximity to the camera. The GPS coordinates are converted
from WGS84 to Cartesian coordinates in meters by a simple
spherical transformation, which is sufficiently accurate given
the size of the covered area. The camera and GPS are not
mutually synchronized, so that the offset between camera time
and GPS time has to be recovered in the postprocessing phase.
The dataset has been recorded along a circular path
throughout the urban road network in the city of Zagreb,
Croatia. The path length was 688 m, the path duration was
111.4 s, while the top speed of the car was about 50 km/h. The
dataset consists of 111 GPS readings and 2786 recorded frames
with timestamps. The scenery was not completely static, since
the video contains occasional moving cars in both directions,
as well as pedestrians and cyclists.
The acquisition was conducted at the time of day with
the largest number of theoretically visible satellites (14). For
comparison, the least value of that number at that day was 8.
In practice, our receiver established connection to 9.5 satellites
along the track, on average. Thus, at 99.1% locations we had
HDOP (horizontal dilution of precision) below 1.3 m while
HDOP was less than 0,9 m at 57.1% locations.
The obtained GPS accuracy has been qualitatively evaluated by plotting the recorded track over a rectified digital aerial
image of the site, as shown in Fig.2. The figure shows that the
GPS track follows the right side of the road accurately and
consistently, except at bottom right where our car had to avoid
parked cars and pedestrians. Thus, the global accuracy of the
recorded GPS track appears to be in agreement with the HDOP
figures stated above. Furthermore, the relative motion between
the neighbouring points which we use in our quantitative
experiments is much more accurate than that and approaches
CCVW 2013
Poster Session
September 19, 2013, Zagreb, Croatia
Fig. 2.
Projection of the acquired GPS track onto rectified aerial imagery.
DGPS accuracy. We note that the aerial orthophoto from the
figure has been provided by Croatian Geodetic Administration,
while the conversion of WGS84 coordinates into Croatian
projected coordinates (HTRS96) and visualization are carried
out in Quantum GIS. For additional verification, the same
experiment has been performed in Google Earth as well, where
we obtained very similar results.
As stated in the introduction, the visual odometry recovers
incremental motion between subsequent images of a video
sequence by exclusively relying on image correspondences.
We shall briefly review the main steps of the technique in the
case of a calibrated and rectified stereo system. Calibration of
the stereo system consists of recovering internal parameters of
the two cameras such as the field of view, as well as exact
position and orientation of the second camera with respect to
the first one. Rectification consists in transforming the two
acquired images so they correspond to images which would
be obtained by a system in which the viewing directions of
the two cameras are mutually parallel and orthogonal to the
baseline. Calibration and rectification significantly simplify the
procedure of recovering inter-frame-motion as we outline in
the following text.
One typically starts by finding and matching the corresponding features in the two images. Because our system
is rectified the search for correspondences can be limited to
the same row of the other image. Since the stereo system is
calibrated, these correspondences can be easily triangulated
and subsequently expressed in metric 3D coordinates. Then
the established correspondences are tracked throughout the
subsequent images. In order to save time, tracking and stereo
matching typically use lightweight point features such as
corners or blobs [9]–[11]. The new positions of previously
triangulated points provide constraints for recovering the new
location of the camera system. The new location can be recovered either analytically, by recovering pose from projections of
known 3D points [12], [13], or by optimization [11], [14],
[15]. The location estimation is usually performed only in
some images of the acquired video sequence, which are often
referred to as key-images.
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
Thus, we have seen that the visual odometry is able to
provide the camera motion between the neighboring keyimages. This motion has 6 degrees of freedom (3 rotation
and 3 translation), and we represent this motion by a 4×4
transformation matrix which we denote by [Rt]. This matrix
contains parameter of the rotation matrix R and the translation
vector t, as shown in equation (1).
r11 r12 r13 t1
r22 r23 t2 
[Rt] =  21
r31 r32 r33 t3 
From a series of [Rt] matrices we can calculate the full
path trajectory by cumulative matrix multiplication. This is
shown in equation (2) where the camera location pvo (tCAM )
is determined by multiplying matrices from the key-image 1
to the key-image t.
pvo (tCAM ) =
At this moment we must observe that the recovered camera
locations pvo (tCAM ) are expressed in the temporal and spatial
coordinates of the 0th key-image which is therefore acquired
at time tCAM = 0 s. The coordinate axes x and y of the
trajectory pvo (tCAM ) are aligned with the corresponding axes
of the 0th key-image, while the z axis is perpendicular to
that image. The locations estimated by visual odometry will
be represented in this coordinate system by accumulating all
estimated transformations from the 0th frame. If we wish to
relate these locations with GPS readings, we shall need to
somehow recover the translation and rotation of the 0th keyframe with respect to the GPS coordinates. We shall achieve
that by aligning the first few incremental motions with the
corresponding GPS motions. However, before we do that, we
first need to achieve the temporal synchronization between the
two sensors.
As stated in section II, the camera and GPS receiver are
not synchronized. Thus, we need to recover the time interval
t0vo between the 0th video frame and the 0th GPS location,
such that the camera time tCAM corresponds to t + t0vo in
GPS time (t = 0 corresponds to the 0th GPS location). We
estimate t0vo by comparing absolute incremental displacements
of trajectories pgps (t) and pvo (tCAM ) obtained by GPS and
visual odometry, respectively. In order to do that, we first
define ∆p(t, ∆t) as the incremental translation at time t:
∆p(t, ∆t) = p(t) − p(t − ∆t) .
We also define ∆s(t) as the absolute travelled distance during
the previous interval of ∆tGPS =1 s (GPS frequency is 1 Hz):
∆s(t) = k∆p(t, ∆tGPS )k, t ∈ N .
Now, if we consider the time instants in which the GPS
positions are defined (that is, integral time in seconds), we
can pose the following optimization problem:
Tlast −1
= argmin
CCVW 2013
Poster Session
(∆svo (t + t0vo ) − ∆sgps (t))2 .
The problem is well-posed since the absolute incremental
displacements ∆svo and ∆sGP S are agnostic with respect
to the fact that the camera and GPS coordinate systems are
still misaligned. We see that in order to solve this problem by
optimization, we need to interpolate all locations obtained by
visual odometry at times of GPS points for each considered
time offset t0vo . However, that does not pose a computational
problem due to very low frequency of the GPS readings (there
are only 111 GPS locations in our dataset). Thus the problem
can be easily solved by any optimization algorithm, and so we
can express visual odometry locations in GPS time pvo0 as:
pvo0 (t) = pvo (t + t0vo ) .
The interpolation is needed due to the fact that we capture
images at approximately 25 Hz and GPS data with 1 Hz (cf.
section II). The time intervals between two subsequent images
often differ from expected 40 ms due to unpredictable bus
congestions within the laptop computer (this is the reason why
camera records timestamps in the first place). We recover the
locations of visual odometry ”in-between” the acquired frames
by the following procedure. We first accumulate timestamps
in the frame sequence until we reach the two frames which
are temporally closest to the desired GPS time. Finally we
determine the desired location between these two frames using
linear interpolation.
We also can relate the two trajectories by the absolute
incremental rotation angle ∆φ between the corresponding two
time instants. We can recover this angle by looking at three
consecutive locations as follows:
∆φ(t) = arccos
h∆p(t, ∆tGPS ), ∆p(t + 1, ∆tGPS ), i
k∆p(t, ∆tGPS )kk∆p(t + 1, ∆tGPS )k
This procedure is illustrated in Fig. 3. Thus we propose two
metrics suitable for relating the trajectories obtained by GPS
and visual odometry: ∆s(t) and ∆φ(t). Note that in order to
be able to determine these metrics for visual odometry at GPS
times, one needs to apply the previously recovered offset t0vo
and employ interpolation between the closest image frames.
Fig. 3. The absolute incremental rotation angle ∆φ can be determined from
incremental translation vectors.
Now that we have synchronized data between camera and
GPS, we need to estimate the 3D alignment of the reference
coordinate system of the visual odometry with respect to the
GPS coordinate system. In other words we need to find a rigid
transformation between the 0th GPS location and the location
of the 0th camera frame. After this is completed, we shall be
able to illustrate the overall accuracy of the visual odometry
results with respect to the GPS readings.
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
The translation between the two coordinate systems can be
simply expressed as:
x (m)
MSE = 2.912, MAE = 1.512
In our experiments we employ the library Libviso2 which
recovers the 6 DOF motion of a moving stereo rig by minimizing reprojection error of sparse feature matches [11]. Libviso2
requires rectified images on input. Currently, we consider each
third frame from the dataset because, with 25 fps, camera
movement between two consecutive frames is very small with
urban driving speeds and such small camera movements often
add more noise than useful data. We calibrated our stereo
rig on two different calibration datasets by means of the
OpenCV module calib3d. In the first case the calibration
pattern was printed on A4 paper, while in the second the
calibration pattern was displayed on 21” LCD screen with
HD resolution (1920x1080). In both cases, the raw calibration
parameters have been employed to rectify the input images
and to produce rectified calibration parameters which are
supplied to Libviso2. The resulting dataset and calibration
parameters is freely downloadable for research purposes from∼ssegvic/datasets/fer-bumble.tar.gz.
A. A4 calibration
We first present the results obtained with the A4 calibration
dataset. The resulting trajectories pvo00 (t) and pgps (t) are
compared in Fig.4. We note that the shape of the trajectory
is mostly preserved, however there is a large deviation in
scale. We compare the corresponding incremental absolute
displacements ∆sgps (t) and ∆svo00 (t), and the scale error
∆sgps (t)/∆svo00 (t) In Fig.5. We observe that the two graphs
are well aligned, and that the visual odometry generally undershoots in translation motion compared to the GPS groundtruth.
CCVW 2013
Poster Session
time (s)
time (s)
Fig. 5. Comparison of incremental translations (left) and the resulting scale
error (right) as obtained with GPS and visual odometry with the A4 calibration.
Note that the scale error is not constant. Thus the results could
not be improved by simply fixing the baseline recovered by
the calibration procedure.
B. LCD calibration
We now present the results obtained with the LCD calibration dataset. The GPS trajectory pgps (t) and the resulting
visual odometry trajectory pvo00 (t) are shown in Fig.6. By
comparing this figure with Fig.4, we see that a larger calibration pattern produced a huge impact on final results.
visual odometry
z (m)
visual odometry
We note that this approach would be underdetermined if
we had only straight car motion at the beginning of video,
and to avoid that we arranged that our dataset begins on the
road turn. Experiments showed that this approach works well
in practice. Better accuracy could be obtained by employing a
GPS sensor capable of producing rotational readings, however
that would be out of the scope of this work.
Fig. 4. Comparison between the GPS trajectory and the recovered visual
odometry trajectory with the A4 calibration.
scale error (gps / odometry)
pvo00 (t) = RGPS
· pvo0 (t) + TGPS
z (m)
In order to recover rotation alignment, we consider incremental
translation vectors for visual odometry (∆pvo0 (t)) and GPS
(∆pgps (t)) between two consecutive GPS times, as determined
in (3). We find the optimal rotation alignment of the visual
odometry coordinate system RGPS
by minimizing the followvo
ing error function:
hR · ∆pvo0 (t), ∆pgps (t)i
k∆pvo0 (t)kk∆pgps (t)k
In order to bypass the accumulated noise which grows over
time, we choose N = 4. This problem is easily solved by
any nonlinear optimization algorithm, and so we can express
visual odometry locations in GPS Cartesian coordinates and
GPS time pvo00 as:
distance (m)
= pGP S (0 s) − pvo0 (0 s) .
visual odometry
x (m)
Fig. 6. Comparison between the GPS trajectory and the recovered visual
odometry trajectory with the LCD calibration.
C. Influence of distant features
While analyzing Libviso2 source code, we noticed that
it rounds all zero disparities to 1.0. This is necessary since
otherwise there would be a division by zero in the triangulation
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
z (m)
GPS rotation angles
odometry rotation angles
visual odometry
MSE = 24.05, MAE = 2.882
angle (deg)
procedure. However, this means that all features with zero
disparity are triangulated on a plane which is much closer
to the camera than the infinity where it should be (note that
the distance of that plane depends on the baseline). In order
to investigate the influence of that decision to the recovered
trajectory, we changed the magic number from 1.0 to 0.01.
The effects of that change are shown in Fig.7, where we note
a significant improvement with respect to Fig.6.
time (s)
Fig. 9. Comparison of incremental rotations as obtained with GPS and visual
odometry with the LCD calibration and library correction.
to define mean square error (M SE) and mean absolute errors
(M AE) for quantitative assessment of the achieved accuracy.
x (m)
Fig. 7. Comparison between the GPS trajectory and the recovered visual
odometry trajectory with the LCD calibration and library correction.
The absolute incremental translations and the resulting
scale errors for this case are shown in Fig.8. The corresponding
absolute incremental rotations are shown in Fig.9. We note a
very nice alignment for incremental translation, and somewhat
less successful alignment for incremental rotation. Note that
large discrepancies in absolute incremental rotation at times
35 s and 97 s occur at low speeds (cf. Fig.8 (left)), when the
equation (7) becomes less well-conditioned.
MSE = 0.101, MAE = 0.249
scale error (gps / odometry)
visual odometry
distance (m)
time (s)
time (s)
Fig. 8.
Comparison of incremental translations (left) and the resulting
scale error (right) as obtained with GPS and visual odometry with the LCD
calibration and library correction.
Note that this issue could also have been solved by simply
neglecting features with zero disparity. However, that would
be wasteful since the features at infinity provide valuable
constraints for recovering the rotational part of inter-frame
motion. We believe that this has been overlooked in the original
library since Libviso was originally tested on a stereo setup
with 4x larger baseline and 2x larger resolution [11]. Thus the
original setup entails much less features with zero disparity,
while the effect of these features to the reconstruction accuracy
is not easily noticeable.
D. Quantitative comparison of the achieved accuracy
We assess the overall achieved accuracy of the recovered
trajectory pvo00 (t) by relying on previously introduced incremental metrics ∆φ and ∆s. These metrics shall now be used
CCVW 2013
Poster Session
M SEtrans =
M AEtrans =
1 X
(∆svo00 (t) − ∆sgps (t))2
N t=1
1 X
|∆svo00 (t) − ∆sgps (t)|
N t=1
M SErot =
M AErot =
1 X
(∆φvo00 (t) − ∆φgps (t))2
N t=1
1 X
|∆φvo00 (t) − ∆φgps (t)|
N t=1
The obtained results are presented in Table I. The rows
of the table correspond to the metrics M SErot , M AErot ,
M SEtrans and M AEtrans . The columns correspond to the
original library with the A4 calibration (A4), the original
library with the LCD calibration (LCD1), and the corrected
library with the LCD calibration (LCD2). A considerable
improvement is observed between A4 and LCD1, while the
difference between LCD1 and LCD2 is still significant.
M SEtrans (m )
M AEtrans (m)
M SErot (deg )
M AErot (deg)
E. Experiments on the artificial dataset
The previous results show that, alongside camera calibration, the feature triangulation accuracy has a large impact on
the accuracy of visual odometry. In this section we explore
that finding on different baselines of the stereo rig. To do
this, we develop an artificial model of a rectified stereo
camera system. On input, the model takes camera intrinsic
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
and extrinsic parameters as well as some parameters of the
camera motion. Then it generates an artificial 3D point cloud
and projects it to the image plane in every frame of camera
motion. The point cloud is generated in a way so that its
projections resemble features we encounter in real world while
analysing imagery acquired from a driving car. On output,
the model produces the feature tracks which are supplied
as input to the library for visual odometry as input. Fig.10
shows the comparison of the straight trajectory in relation to
different camera baseline setups. As one would expect, the
obtained accuracy significantly drops as the camera baseline
is decreased. Furthermore, Fig.11 shows the results on the
same dataset after modification in treatment of zero-disparity
features inside the library. The graphs in the figure show that
for small baselines the modified library performs increasingly
better than the original. This improvement occurs since the
number of zero-disparity features increases as the baseline
becomes smaller.
significantly affect the reconstruction accuracy. Additionally,
we have seen that features with zero disparity should be treated
with care, especially in small-baseline setups. Finally, the most
important conclusion is that small-baseline stereo systems can
be employed as useful tools in SfM analysis of video from the
driver’s perspective.
Our future research shall be directed towards exploring the
influence of other implementation details to the reconstruction
accuracy. The developed implementations shall be employed
as a tool for improving the performance of several computer
vision applications for driver assistance, including lane recognition, lane departure warning and traffic sign recognition. We
also plan to collect a larger dataset corpus which would contain
georeferenced videos acquired by stereo rigs with different
This work has been supported by research projects Vista
(EuropeAid/131920/M/ACT/HR) and Research Centre for Advanced Cooperative Systems (EU FP7 #285939).
X (m)
b = 0.5
b = 0.2
b = 0.12
b = 0.09
Fig. 10.
Z (m)
Reconstructed trajectories with original Libviso2.
X (m)
b = 0.5
b = 0.2
b = 0.12
b = 0.09
Fig. 11.
Z (m)
Reconstructed trajectories with modified Libviso2.
We have proposed a testbed for assessing the accuracy of
visual odometry with respect to the readings of an unsynchronized consumer-grade GPS sensor. The testbed allowed us to
acquire an experimental GPS-registered dataset suitable for
evaluating existing implementations in combination with our
own stereo system. The acquired dataset has been employed to
assess the influence of calibration dataset and some implementation details to the accuracy of the reconstructed trajectories.
The obtained experimental results show that a consumer
grade GPS system is able to provide useful groundtruth for
assessing performance of visual odometry. This still holds even
if the synchronization and alignment with the camera system
is performed at the postprocessing stage. The experiments also
show that the size and the quality of the calibration target may
CCVW 2013
Poster Session
R. Bunschoten and B. Krose, “Visual odometry from an omnidirectional
vision system,” Robotics and Automation, 2003. Proceedings. ICRA
’03., pp. 577–583, 2003.
C. Silva and J. Santos-Victor, “Robust egomotion estimation from the
normal flow using search subspaces,” Pattern Analysis and Machine
Intelligence, IEEE Transactions on, pp. 1026 – 1034, 1997.
J. Borenstein, H. R. Everett, L. Feng, and D. Wehe, “Mobile robot
positioning: Sensors and techniques,” Journal of Robotic Systems,
vol. 14, no. 4, pp. 231–249, Apr. 1997.
S. Goldberg, M. Maimone, and L. Matthies, “Stereo vision and rover
navigation software for planetary exploration,” Aerospace Conference
Proceedings, 2002. IEEE, vol. 5, 2002.
C. Yang, M. Maimone, and L. Matthies, “Visual odometry on the
mars exploration rovers,” Systems, Man and Cybernetics, 2005 IEEE
International Conference, vol. 1, pp. 903–910, 2005.
“Image sequence analysis test site (eisats),”
nz/EISATS, the University of Auckland, New Zealand.
A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics:
The kitti dataset,” International Journal of Robotics Research, 2013.
A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous
driving? the kitti vision benchmark suite,” in Conference on Computer
Vision and Pattern Recognition (CVPR), 2012.
C. Harris and M. Stephens, “A combined corner and edge detector,” in
Proceedings of the Alvey Vision Conference, 1988, pp. 147–152.
J. Shi and C. Tomasi, “Good features to track,” in Proceedings of
the Conference on Computer Vision and Pattern Recognition, Seattle,
Washington, Jun. 1994, pp. 593–600.
B. Kitt, A. Geiger, and H. Lategahn, “Visual odometry based on
stereo image sequences with ransac-based outlier rejection scheme,”
in Proceedings of the IEEE Intelligent Vehicles Symposium. Karlsruhe
Institute of Technology, 2010.
L. Quan and Z.-D. Lan, “Linear n-point camera pose determination,”
IEEE Transactions on Pattern recognition and Machine Intelligence,
vol. 21, no. 8, pp. 774–780, Aug. 1999.
D. Nistér, O. Naroditsky, and J. R. Bergen, “Visual odometry for ground
vehicle applications,” Journal of Field Robotics, vol. 23, no. 1, pp. 3–20,
L. Morency and R. Gupta, “Robust real-time egomotion from stereo
images,” Image Processing, 2003. ICIP 2003. Proceedings. 2003 International Conference on, pp. 719–22, 2003.
C. Engels, H. Stewénius, and D. Nistér, “Bundle adjustment rules,” in
Photogrammetric Computer Vision, 2006, pp. 1–6.
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
Surface registration using genetic algorithm
in reduced search space
Vedran Hrgetić, Tomislav Pribanić
Department of Electronic Systems and Information Processing
Faculty of Electrical Engineering and Computing
Zagreb, Croatia
[email protected],
[email protected]
Abstract – Surface registration is a technique that is used in
various areas such as object recognition and 3D model
reconstruction. Problem of surface registration can be analyzed
as an optimization problem of seeking a rigid motion between
two different views. Genetic algorithms can be used for solving
this optimization problem, both for obtaining the robust
parameter estimation and for its fine-tuning. The main drawback
of genetic algorithms is that they are time consuming which
makes them unsuitable for online applications. Modern
acquisition systems enable the implementation of the solutions
that would immediately give the information on the rotational
angles between the different views, thus reducing the dimension
of the optimization problem. The paper gives an analysis of the
genetic algorithm implemented in the conditions when the
rotation matrix is known and a comparison of these results with
results when this information is not available.
Keywords – computer vision;
reconstruction; genetic algorithm
Surface registration is an important step in the process of
reconstructing the complete 3D object. Acquisition systems
can mostly give only a partial view of the object and for its
complete reconstruction should capture from the multiple
views. The task of the surface registration algorithms is to
determine the corresponding surface parts in the pair of
observed clouds of 3D points, and on that basis to determine
the spatial translation and rotation between the two views.
Various techniques for the surface registration are proposed
and they can be generally divided into the two groups: coarse
and fine registration methods [1]. In the coarse registration the
CCVW 2013
Poster Session
main goal is to compute an initial estimation of the rigid
motion between the two views, while in the fine registration
the goal is to obtain the most accurate solution by refining a
known initial estimation of the solution.
The problem of finding the required rigid motion can be
viewed as solving a six dimensional optimization problem
(translation on x-, y- and z-axis, and rotation about x-, y- and
z-axis). Genetic algorithms (GA) can be used to robustly solve
this hard optimization problem and their advantage is that they
are applicable both as coarse and fine registration methods [1].
In this work we propose a surface registration method
assuming that the rotation is provided by an inertial sensor and
the translation vector is still left to be found. We note that the
registration method, based on the abovementioned
assumption, has been successfully tested earlier [4]. However,
in this work we specifically employ GA in that context. We
justify our assumption by the fact that nowadays technologies
have made quite affordable a large pallet of various inertial
devices which reliably outputs data about the object
orientation. For example, an inertial sensor can be explicitly
used [2], or alternatively, there are also smart cameras which
have an embedded on-board inertial sensor unit for the
orientation detection in 3D space [3].
Genetic algorithm is a metaheuristic based on the concept of
natural selection and evolution (Fig. 1.). GA represents each
candidate solution as an individual that is defined by its
parameters (i.e. its genetic material) and each candidate solution
is qualitatively evaluated by the fitness function. Better solutions
are more likely to reproduce, and the system of reproduction is
defined by the crossover and mutation. The use of genetic
algorithms has the advantage of avoiding the local
Proceedings of the Croatian Computer Vision Workshop, Year 1
minima which is a common problem in registration, especially
when the initial motion is not provided. Another advantage of
genetic algorithms is that they work well in the presence of
noise and outliers given by the non-overlapping regions. The
main problem of GA is the time required to converge, which
makes it unsuitable for online applications.
Chow presented a dynamic genetic algorithm for surface
registration where every candidate solution is described by a
chromosome composed of the 6 parameters of the rigid motion
that accurately aligns a pair of range images [5]. The
chromosome is composed of the three components of the
translation vector and the three angles of the rotation vector. In
order to minimize the registration error, the median of
Euclidean distances between corresponding points in two
views is chosen as the fitness function. A similar method was
proposed the same year by Silva [6]. The main advantage of
this work is that the more robust fitness function is used and
the initial guess is not required.
September 19, 2013, Zagreb, Croatia
corresponding points instead of mean for evaluation of
solutions, initial tests have proven that mean gives better
practical results. This can be explained by the different use of
GA, where Chow uses GA as a fine registration method, we
use GA as both coarse and fine registration methods where in
the absence of a good initial solution using median can cause
algorithm to converge to a local minima more easily. Finally,
we evaluate the proposed GA method assuming that the
rotation is known in advance.
A. Implementation of genetic algorithm
We use two genetic algorithms in a sequence. First is used
as the coarse registration method and produces an initial
estimate of the solution, while other is used as the fine
registration method (Fig. 2.).
Crossover and mutation are basic genetic operators that are
used to generate the new population of solutions from the
previous population. In our work we use the crossover method
that is accomplished by generating two random breaking
points between six variables of the chromosome and mutation
is defined as the addition or subtraction of a randomly
generated parameter value to one of the variables of the
chromosome. Probability for the mutation and the maximum
value for which a variable can mutate dynamically change
during the process of execution.
Fig. 2. Implementation of genetic algorithm
Population size and number of generations are two
important factors of GA design. It is often cited in the
literature that the population of 80 to 100 individuals is
sufficient to solve more complex optimization problems [7].
Overpopulation may affect the ability of more optimal
solutions to successfully reproduce (survival of the fittest) and
thus prevent the convergence of algorithm towards the desired
solution. If the population is too small then there’s a greater
danger that the algorithm will converge to some local extreme.
After initial testing we have chosen parameters for GA as
displayed in Table 1.
Fig. 1. Genetic algorithm
In this work we have used the same definition of the
chromosome as was suggested by Chow in his paper [5]. The
fitness function is defined as an overall mean of all the Euclidean
distances between the corresponding points of the two clouds of
3D points for a given rigid motion. The objective of genetic
algorithm is to find a candidate solution of the rigid motion that
minimizes the overall mean distance between the corresponding
points of the two views. Although Chow claims that it is better to
use median of Euclidean distances between
CCVW 2013
Poster Session
CoarseGA function implements GA as a coarse
registration method. Population is initialized by generating the
initial correspondences in a way that the point that is closest to
the center of mass from the first cloud of points is randomly
matched with points in the second cloud of points. Translation
Proceedings of the Croatian Computer Vision Workshop, Year 1
vector for each such motion between views is determined. The
angles of rotation vector are then initialized to randomly
generated values where every solution is rotated only for one
angle, or if we adopt limited space search, a priori known
values of angles are inserted in the algorithm. In the coarseGA
function elitism principle is implemented, transferring best
two solutions directly to the next generation. The remaining
generation is generated by the combined processes of
crossover and mutation. Mutation probability for each gene is
16%. The algorithm is performed for fixed 250 generations.
FineGA inherits the last population of coarseGA. In the
each iteration fineGA keeps the two best solutions, while 44
new individuals are generated by the processes of crossover
and mutation and four new individuals are generated by the
process of mutation only. Starting mutation probability is 20%
and it increases for 5% every 25 iterations, while the
maximum value for which the parameter can change decreases
by a factor of 0.8. A rationale why we dynamically adjust the
mutation parameters is that as the population converge to a
good solution, smaller adjustments of genes are needed and
therefore mutation becomes primal source for improvement of
genetic material. That means that a mutation needs to happen
more often, but at the same time it should produce less radical
changes in genetic material.
Finding corresponding points in two clouds of 3D points is
extremely time consuming operation, so to minimize a overall
time of execution both downsampling of the 3D images and a
KDtree search method are used. KDtree method for finding
closest points in two clouds of points was suggested in earlier
works [8] and has proven to be very fast compared to other
Software was implemented in MATLAB development
environment. 13 3D test images were obtained by partial
reconstruction of a model doll using a structured light scanner. An
algorithm was tested both in situation where no a priori
information about rotational angles is available and where exact
values of angles are known. In order to come up with the known
ground true values we have manually marked 3-4 corresponding
points on each pair of views. That allowed us to compute a rough
estimate for the ground truth data which we have further refined
using the well-known iterative closest point (ICP) algorithm. ICP
is known to be very accurate given a good enough initial solution
[9]. That was certainly the case in our experiments since we have
manually chosen corresponding points very carefully. Therefore
in the present context we regard such registration data as ground
truth values. The basic algorithm (i.e. assuming no rotation data
are known in advance) has proven successful for the registration
of neighboring point clouds, but insufficiently accurate for a good
registration between mutually distant views. An additional
problem is a significant time that is needed for algorithm to
CCVW 2013
Poster Session
September 19, 2013, Zagreb, Croatia
produce the results. Improvement of the system, based on
reduction of search space, has proven much more successful.
Table 2. shows the average deviation of the results from the
ground truth values when we search corresponding points in
two more distant views (i.e. we compare views 3 and 1, 4 and
2, etc.). The first row represents the results when the rotation
is not known (GA1) and the second row displays the results
when the rotation is known (GA2). As algorithm finds three
parameters more quickly than all six parameters, there is no
need for GA to run through all 500 generations, which
significantly decreases an average time of execution. What is
more important, we get much better results. Tables 3. and 4.
show results for views 4-2 and 13-11. The first row shows
ground truth values, the second row shows values found by
GA when the rotation is not known and the third row shows
values when the rotation is known. The column named with %
displays the percentage of the corresponding points between
the two clouds of points. The results for the given views can
also be seen on figures Fig. 3. – Fig. 6. Green and red images
represent two different views, left part of the picture is before
we applied the rigid motion between the views and the right
part of the picture is after the rigid motion is applied and the
images are rotated and translated for the values given by GA.
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
Fig. 3. Views 4-2, unknown rotation
The surface registration is an important step in the process of
3D object reconstruction and can be seen as an optimization
problem with six degrees of freedom. The objective of the
registration algorithms is to find the spatial translation and the
rotation between the views so that two clouds of 3D points
overlap with each other properly. Genetic algorithm has proven to
be a suitable method for solving this particular optimization
problem and can be used both as a coarse and fine registration
method. The tests have shown that reducing the search space of
the optimization problem from six to three parameters leads to
better results and faster execution of the algorithm. The above is
important because today's acquisition systems allow us to deploy
the solutions where we can get the information about the
rotational angles between the different views. Therefore, our next
research objective will be a 3D reconstruction system design
where rotation data are readily available and the surface
registration reduces to computing the translation part only.
[1] J. Salvi,C Matabosch, D. Fofi, J. Forest, A review of recent image
registration methods with accuracy evaluation, Image and Vision Computing
25, 2007, 578-596
Fig. 4. Views 4-2, assuming known rotation in advance
[2] MTx Access date: August 2013.
[3] Shirmohammadi, B., Taylor, C.J., 2011. Self-localizing smart camera
networks. ACM Transactions on Embedded Computing Systems, Vol. 8, pp.
[4] T. Pribanić, Y. Diez, S. Fernandez, J. Salvi. An Efficient Method for
Surface Registration. International Conference on Computer Vision Theory
and Applications, VISIGRAPP 2013, Barcelona Spain, 2013. 500-503
[5] K. C. Chow,H. T.Tsui, T. Lee, Surface registration using a dynamic
genetic algorithm, Pattern Recognition 37 2004, 105-107
Fig. 5. Views 13-11, unknown rotation
[6] L. Silva, O. Bellon, K. Boyer, Enhanced, robust genetic algorithms for
multiview range image registration, in: 3DIM03. Fourth International
Conference on 3-D Digital Imaging and Modeling, 2003, 268–275.
[7] K. F. Man, K. S. Tang, S. Kwong, Genetic algorithms: concepts and
applications, IEEE transactions on industrialelectronics, vol. 43, no. 5,
1996, 519-535
[8] M. Vanco, G. Brunnett, T. Schreiber, A hashing strategy for eOcient knearest neighbors computation, in: Proceedings of the International
Conference Computer Graphics, 1999, pp. 120–128.
[9] Rusinkiewicz, S., Levoy, M., 2001. Efficient variant of the ICP
algorithm, in: 3rd International Conference on 3-D Digital Imaging and
Modeling, pp. 145–152
Fig. 6. Views 13-11, assuming known rotation in advance
CCVW 2013
Poster Session
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
Filtering for More Accurate Dense Tissue
Segmentation in Digitized Mammograms
Mario Mustra, Mislav Grgic
University of Zagreb, Faculty of Electrical Engineering and Computing, Unska 3, 10000 Zagreb, Croatia
[email protected]
Abstract - Breast tissue segmentation into dense and fat tissue
is important for determining the breast density in mammograms.
Knowing the breast density is important both in diagnostic and
computer-aided detection applications. There are many different
ways to express the density of a breast and good quality
segmentation should provide the possibility to perform accurate
classification no matter which classification rule is being used.
Knowing the right breast density and having the knowledge of
changes in the breast density could give a hint of a process which
started to happen within a patient. Mammograms generally
suffer from a problem of different tissue overlapping which
results in the possibility of inaccurate detection of tissue types.
Fibroglandular tissue presents rather high attenuation of X-rays
and is visible as brighter in the resulting image but overlapping
fibrous tissue and blood vessels could easily be replaced with
fibroglandular tissue in automatic segmentation algorithms.
Small blood vessels and microcalcifications are also shown as
bright objects with similar intensities as dense tissue but do have
some properties which makes possible to suppress them from the
final results. In this paper we try to divide dense and fat tissue by
suppressing the scattered structures which do not represent
glandular or dense tissue in order to divide mammograms more
accurately in the two major tissue types. For suppressing blood
vessels and microcalcifications we have used Gabor filters of
different size and orientation and a combination of
morphological operations on filtered image with enhanced
Keywords - Gabor Filter; Breast Density; CLAHE; Morphology
Computer-aided diagnosis (CAD) systems have become an
integral part of modern medical systems and their
development should provide a more accurate diagnosis in
shorter time. Their aim is to help radiologists, especially in
screening examinations, when a large number of patients are
examined and radiologist often spend very short time for
readings of non-critical mammograms. Mammograms are Xray breast images of usually high resolution with moderate to
high bit-depth which makes them suitable for capturing fine
details. Because of the large number of captured details,
computer-aided detection (CADe) systems have difficulties in
detection of desired microcalcifications and lesions in the
image. Since mammograms are projection images in
grayscale, it is difficult to automatically differentiate types of
CCVW 2013
Poster Session
breast tissue because different tissue types can have the same
or very similar intensity. The problem which occurs in
different mammograms is that the same tissue type is shown
with a different intensity and therefore it is almost impossible
to set an accurate threshold based solely on the histogram of
the image. To overcome that problem different authors have
came up with different solutions. Some authors used statistical
feature extraction and classification of mammograms into
different categories according to their density. Other
approaches used filtering of images and then extracting
features from filter response images to try getting a good
texture descriptor which can be classified easier and provide a
good class separability. Among methods which use statistical
feature extraction, Oliver et al. [1] obtained very good results
using combination of statistical features extracted not directly
from the image, but from the gray level co-occurrence
matrices. We can say that they have used 2nd order statistics
because they wanted to describe how much adjacent dense
tissue exists in each mammogram. Images were later classified
into four different density categories, according to BI-RADS
[2]. Authors who used image filtering techniques tried to
divide, as precisely as possible, breast tissue into dense and fat
tissue. With the accurate division of dense and fat tissue in
breasts it would be possible to quantify the results of breast
density classification and classification itself would become
trivial because we would have numerical result in number of
pixels belonging to the each of two groups. However, the task
of defining the appropriate threshold for dividing breast tissue
into two categories is far from simple. Each different
mammogram captured using the same mammography device
is being captured with slightly different parameters which will
affect the final intensity of the corresponding tissues in the
different image. These parameters are also influenced by the
physical property of each different breast, for example its size
and amount of dense and fat tissue. In image acquisition
process the main objective is to produce an image with very
good contrast and no clipping in both low and high intensity
regions. Reasons for different intensities for the corresponding
tissue, if we neglect usage of different imaging equipment, are
difference in the actual breast size, difference in the
compressing force applied to the breast during capturing
process, different exposure time and different anode current.
Having this in mind authors tried to overcome that problem by
Proceedings of the Croatian Computer Vision Workshop, Year 1
applying different techniques which should minimize
influence of capturing inconsistencies. Muhimmah and
Zwiggelaar [3] presented an approach of multiscale histogram
analysis having in mind that image resizing will affect the
histogram shape because of detail removal when image is
being downsized. In this way they were able to remove small
bright objects from images and tried to get satisfactory results
by determining which objects correspond to large tissue areas.
Petroudi et al. [4] used Maximum Response 8 filters [5] to
obtain a texton dictionary which was used in conjunction with
the support-vector machine classifier to classify the breast into
four density categories. Different equipment for capturing
mammograms produces resulting images which have very
different properties. The most common division is in two main
categories: SFM (Screen Film Mammography) and FFDM
(Full-Field Digital Mammography). Tortajada et al. [6] have
presented a work in which they try to compare accuracy of the
same classification method on the SFM and the FFDM
images. Results which they have obtained show that there is a
high correlation of automatic classification and expert
readings and overall results are slightly better for FFDM
images. Even inside the same category, such is FFDM,
captured images can be very different. DICOM standard [7]
recommends bit-depth of 12 bits for mammography modality
and images are stored according to the recommendation.
However if we later observe the histogram of those images, it
is clear that the actual bit-depth is often much lower and is
usually below 10 bits.
In this paper we present a method which should provide a
possibility for division of the breast tissue between
parenchymal tissue and fatty tissue without influence of
fibrous tissue, blood vessels and fine textural objects which
surround the fibroglandular disc. Segmentation of dense or
glandular tissue from the entire tissue will be made by setting
different thresholds. Our goal is to remove tissue which
interferes with dense tissue and makes the division less
accurate because non-dense tissue is being treated as dense
due to its high intensity when compared to the rest of the
tissue. Gabor filters generally proved to be efficient in
extracting features for the breast cancer detection from
mammograms because of their sensitivity to edges in different
orientations [8]. Therefore, for the removal of blood vessels,
we have used Gabor filter bank which is sensitive to brighter
objects which are rather narrow or have high spatial
frequency. Output of the entire filter bank is an image which is
created from superimposed filter responses of different
orientations. Subtraction of the image which represents vessels
and dense tissue boundaries from the original image produces
a much cleaner image which can later be enhanced in order to
equalize intensity levels of the corresponding tissue types
among different images. In that way we will be able to distinct
dense tissue from fat more accurately. The proposed method
has been tested on mammograms from the mini-MIAS
database [9] which are all digitized SFMs.
This paper is organized as follows. In Section II we present
the idea behind image filtering using Gabor filter bank and
explain which setup we will use for filtering out blood vessels
CCVW 2013
Poster Session
September 19, 2013, Zagreb, Croatia
and smaller objects. In Section III we present results of
filtering with the appropriate filter and discuss results of
region growing after contrast enhancement and application of
morphological operations. Section IV draws the conclusions.
Gabor filters are linear filters which are most commonly
used for the edge detection purposes as well as the textural
feature extraction. Each filter can be differently constructed
and it can vary in frequency, orientation and scale. Because of
that Gabor filters provide a good flexibility and orientation
invariantism. Gabor filter in a complex notation can be
expressed as:
 (x cos θ + y sin θ )2 + γ 2 ( y cos θ − x sin θ )2
G = exp −
2σ 2
  2π (x cos θ + y sin θ )
⋅ exp i
+ ψ  ,
where θ is the orientation of the filter, γ is the spatial aspect
ratio, λ is the wavelength of the sinusoidal factor, σ is the
sigma or width of the Gaussian envelope and ψ is the phase
offset. This gives a good possibility to create different filters
which are sensitive to different objects in images. To be able
to cover all possible blood vessels and small linearly shaped
objects it is necessary to use more than one orientation. In our
experiment we have used 8 different orientations and therefore
obtained the angle resolution of 22.5°. Figure 1 (a)-(h) shows
8 different filter orientations created using (1) with the angle
resolution of 22.5° between each filter, from 0 to 157.5°
Besides the orientation angle, one of the most commonly
changed variables in (1) is the sinusoidal frequency. Different
sinusoidal frequencies will provide different sensitivity of the
used filter for different spatial frequencies of objects in
images. If the chosen filter contains more wavelengths, filtered
image will correspond more to the original image because
filters will be sensitive to objects of a high spatial frequency,
e.g. details. In the case of smaller number of wavelengths,
filtered image will contain highly visible edges. Figure 2
shows different wavelengths of Gabor filter with the same
Figure 1. Gabor filters of the same scale and wavelength with different
orientations: (a) 0°; (b) 22.5°; (c) 45°; (d) 67.5°; (e) 90°; (f) 112.5°; (g) 135°;
(h) 157.5°.
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
Figure 2. (a)-(d) Gabor filters which contains larger to smaller number of
sinusoidal wavelengths respectively.
There is of course another aspect of the filter which needs to
be observed and that is the actual dimension of the filter.
Dimension of the filter should be chosen carefully according
to image size and the size of object which we want to filter out
and should be less than 1/10 of the image size.
Preprocessing of images is the first step which needs to be
performed before filtering. Preprocessing steps include image
registration, background suppression with the removal of
artifacts and the pectoral muscle removal in the case of
Medio-Lateral Oblique mammograms. For this step we have
used manually drawn segmentation masks for all images in the
mini-MIAS database. These masks were hand-drawn by an
experienced radiologist and, because of their accuracy, can be
treated as ground truth since there is no segmentation error
which can occur as the output of automatic segmentation
algorithms. The entire automatic mask extraction process has
been described in [10] and steps for the image "mdb002" from
the mini-MIAS database are shown in Figure 3.
After the preprocessing we have proceeded with locating
of the fibroglandular disc position in the each breast image.
Fibroglandular disc is a region containing mainly two tissue
types, dense or glandular and fat and according to their
distribution it is possible to determine in which category
according to the density some breast belong.
CCVW 2013
Poster Session
Figure 3. (a) Original mini-MIAS image "mdb002"; (b) Registered image
with removed background and pectoral muscle; (c) Extracted ROI from the
same image which will be filtered using Gabor filter.
Dense tissue mainly has higher intensity in mammograms
because it presents higher attenuation for X-rays than fat
tissue. Intensity also changes with the relative position
towards edge of the breast because of the change in thickness.
Since the fibroglandular disc is our region of interest, we have
extracted only that part of the image. Entire preprocessing step
done for all images in the mini-MIAS database is described in
[11]. To extract ROI we have cropped part of the image
according to the maximum breast tissue dimensions as shown
in Figure 4. Actual ROI boundaries are chosen to be V and H
for vertical and horizontal coordinates respectively:
 max(horizontal ) max(horizontal )
+ max(horizontal )
V =
 (2)
 max(vertical )
: max(vertical )
H =
where max(horizontal) is the vertical coordinate of the
maximal horizontal dimension, and max(vertical) is the
horizontal coordinate of the maximal vertical dimension.
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
Figure 6. (a) Enhanced ROI from "mdb001" with the applied threshold; (b)
Enhanced ROI from "mdb006" with the applied threshold.
Figure 4. Defining maximal breast tissue dimension for ROI extraction.
With this approach we get a large part or entire fibroglandular
disc isolated and there is no need for the exact segmentation of
it. It would be good if we could eliminate fibrous tissue and
blood vessels and treat our ROI as it is completely uniform in
the case of low density breasts. To be able to perform that task
we can choose an appropriate Gabor filter sensitive to objects
that we want to remove. A good Gabor filter for detection of
objects with high spatial frequency contains less sinusoidal
wavelengths, like the ones showed in Figure 2 (c) and (d).
Contrast Limited Adaptive Histogram Equalization
(CLAHE) [12] is a method for local contrast enhancement
which is suitable for equalization of intensities in each ROI
that we observe. Contrast enhancement obtained using
CLAHE method will provide better intensity difference
between dense and fat tissue. CLAHE method is based on a
division of the image into smaller blocks to allow better
contrast enhancement and at the same time uses clipping of
maximal histogram components to achieve a better
enhancement of mid-gray components. If we observe the same
ROI before and after applying CLAHE enhancement it is clear
that fat tissue can be filtered out easier after the contrast
enhancement. Figure 5 shows application of the contrast
enhancement using CLAHE on "mdb001".
Figure 5. (a) Original ROI from "mdb001"; (b) Same ROI after the contrast
enhancement using CLAHE.
CCVW 2013
Poster Session
If we apply the threshold on the enhanced ROI we will get
the result for "mdb001" and "mdb006" as shown in Figure 6
(a) and (b). These two images belong to the opposite
categories according to the amount of dense tissue. The
applied threshold is set to 60% of the mean image intensity.
After applying threshold, images have visibly different
properties according to the tissue type. It is not possible to
apply the same threshold because different tissue type has
different intensities. Contrast enhancement makes the
detection of fibrous tissue and vessels easier especially after
Gabor filtering, Figure 7 (a) and (b).
After contrast enhancement and filtering images using
Gabor filter to remove fibrous tissue we need to make a
decision in which category according to density each breast
belongs. For that we will use binary logic with different
threshold applied to images. We will apply two thresholds, at
60% and 80% of the maximal intensity and calculate the area
contained in both situations. For that we will use logical AND
operator, Figure 8. From Figure 8 (e) and (f) we can see that
combination of threshold images using the logical AND for
low and high density give the correct solution. Figures 8 (e)
and (f) show the result of (a) AND (c) and (b) AND (d).
Figure 7. (a) ROI from "mdb001" after applying Gabor filter; (b) ROI from
"mdb006" after applying Gabor filter.
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
blood vessels or small fibrous tissue segments and contrast
enhancement provides a comparability of the same tissue type
in different mammograms. Our future work in this field will be
development of automatic segmentation algorithms for dense
tissue in order to achieve a quantitative breast density
classification by knowing the exact amount of the dense tissue.
The work described in this paper was conducted under the
research project "Intelligent Image Features Extraction in
Knowledge Discovery Systems" (036-0982560-1643),
supported by the Ministry of Science, Education and Sports of
the Republic of Croatia.
Figure 8. (a) "mdb001" after filtering out blood vessels and threshold at 60%
of maximal intensity; (b) "mdb006" after filtering out blood vessels and
threshold at 60% of maximal intensity; (c) "mdb001", threshold at 80%;
(d) "mdb006", threshold at 80%; (e) "mdb001" threshold at 60% AND
threshold at 80%; (f) "mdb006" threshold at 60% AND threshold at 80%.
In this paper we have presented a method which combines
the usage of local contrast enhancement biased on the mid-gray
intensities. The contrast enhancement has been achieved using
CLAHE method. Gabor filters have been used for the removal
of blood vessels and smaller portions of fibrous tissue which
has similar intensity as dense tissue. Combination of different
thresholds in conjunction with the logical AND operator
provided a good setup for determining whether we have
segmented a fat or dense tissue and therefore can give as a rule
for segmenting each mammogram with different threshold, no
matter what the overall tissue intensity is. The advantage of
Gabor filter over classical edge detectors is in easy orientation
changing and possibility to cover all possible orientations by
superpositioning filter responses. Usage of Gabor filters
improves the number of false positive results which come from
CCVW 2013
Poster Session
A. Oliver, J. Freixenet, R. Martí, J. Pont, E. Pérez, E.R.E. Denton, R.
Zwiggelaar, "A Novel Breast Tissue Density Classification
Methodology", IEEE Transactions on Information Technology in
Biomedicine, Vol. 12, Issue 1, January 2008, pp. 55-65.
[2] American College of Radiology, "Illustrated Breast Imaging Reporting
and Data System (BI-RADS) Atlas", American College of Radiology,
Third Edition, 1998.
[3] I. Muhimmah, R. Zwiggelaar, "Mammographic Density Classification
using Multiresolution Histogram Information", Proceedings of the
International Special Topic Conference on Information Technology in
Biomedicine, ITAB 2006, Ioannina - Epirus, Greece, 26-28 October
2006, 6 pages.
[4] S. Petroudi, T. Kadir, M. Brady, "Automatic Classification of
Mammographic Parenchymal Patterns: a Statistical Approach",
Proceedings of the 25th Annual International Conference of the IEEE
Engineering in Medicine and Biology Society, Cancun, Mexico, Vol. 1,
17-21 September 2003, pp. 798-801.
[5] M. Varma and A. Zisserman, "Classifying images of materials:
Achieving viewpoint and illumination independence", Proceedings of
the European Conference on Computer Vision, Copenhagen, Denmark,
2002, pp. 255–271.
[6] M. Tortajada, A. Oliver, R. Martí, M. Vilagran, S. Ganau, . Tortajada,
M. Sentís, J. Freixenet, "Adapting Breast Density Classification from
Digitized to Full-Field Digital Mammograms", Breast Imaging, Springer
Berlin Heidelberg, 2012, pp. 561-568.
[7] Digital Imaging and Communications in Medicine (DICOM), NEMA
[8] Y. Zheng, "Breast Cancer Detection with Gabor Features from Digital
Mammograms", Algorithms 3, No. 1, 2010, pp. 44-62.
[9] J. Suckling, J. Parker, D.R. Dance, S. Astley, I. Hutt, C.R.M. Boggis, I.
Ricketts, E. Stamatakis, N. Cernaez, S.L. Kok, P. Taylor, D. Betal, J.
Savage, "The Mammographic Image Analysis Society Digital
Mammogram Database", Proceedings of the 2nd International Workshop
on Digital Mammography, York, England, pp. 375-378, 10-12 July
[10] M. Mustra, M. Grgic, R. Huzjan-Korunic, "Mask Extraction from
Manually Segmented MIAS Mammograms", Proceedings ELMAR2011, Zadar, Croatia, September 2011, pp. 47-50.
[11] M. Mustra, M. Grgic, J. Bozek, "Application of Gabor Filters for
Detection of Dense Tissue in Mammograms", Proceedings ELMAR2012, Zadar, Croatia, September 2012, pp. 67-70.
[12] K. Zuiderveld, "Contrast Limited Adaptive Histogram Equalization",
Graphic Gems IV. San Diego, Academic Press Professional, 1994, pp.
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
Flexible Visual Quality Inspection
in Discrete Manufacturing
Tomislav Petković, Darko Jurić and Sven Lončarić
University of Zagreb
Faculty of Electrical and Computer Engineering
Unska 3, HR-10000 Zagreb, Croatia
Email: {tomislav.petkovic.jr, darko.juric, sven.loncaric}
Abstract—Most visual quality inspections in discrete manufacturing are composed of length, surface, angle or intensity
measurements. Those are implemented as end-user configurable
inspection tools that should not require an image processing
expert to set up. Currently available software solutions providing
such capability use a flowchart based programming environment,
but do not fully address an inspection flowchart robustness and
can require a redefinition of the flowchart if a small variation is
In this paper we propose an acquire-register-analyze image
processing pattern designed for discrete manufacturing that
aims to increase the robustness of the inspection flowchart by
consistently addressing variations in product position, orientation
and size. A proposed pattern is transparent to the end-user
and simplifies the flowchart. We describe a developed software
solution that is a practical implementation of the proposed
pattern. We give an example of its real-life use in industrial
production of electric components.
Automated visual quality inspection is the process of
comparing individual manufactured items against some preestablished standard. It is frequently used both for small
and high-volume assembly lines due to low costs and nondestructiveness. Also, ever increasing demands to improve
quality management and process control within an industrial
environment are promoting machine vision and visual quality
inspection with the goal of increasing the product quality and
production yields.
When designing a machine vision system many different
components (camera, lens, illumination and software) must be
selected. A machine vision expert is required to select the
system components based on requirements of the inspection
task by specifying [1]:
Camera: type (line or area), field of view, resolution,
frame rate, sensor type, sensor spectral range,
Lens: focal length, aperture, flange distance, sensor
size, lens quality,
Illumination: direction, spectrum, polarization, light
source, mechanical adjustment elements, and
Software: libraries to use, API ease of use, software
structure, algorithm selection.
Selected components are then used to measure length, surface,
angle or intensity of the product. Based on measurements
a decision about quality of an inspected product is made.
Although software is a small part of an quality inspection
CCVW 2013
Poster Session
system usually, in addition to a machine vision expert, an
image processing expert is required to select the software and
to define the image processing algorithms that will be used.
Nowadays, introduction of GigE Vision [2] and GenICam
[3] standards significantly simplified integration of machine
vision cameras and image processing libraries making vision
solutions for the industry more accessible by eliminating the
need for proprietary software solutions for camera interaction.
Existing open source image processing libraries especially suitable for development of vision applications, such as OpenCV
(Open Source Computer Vision Library, [4]), provide numerous algorithms that enable rapid development of visual quality
inspection systems and simplify software selection, however,
specific details of image processing chain for any particular
visual inspection must be defined by an image processing
expert on a case-by-case basis, especially if both robust and
flexible solution is required.
To eliminate the need for the image processing expert stateof-the art image processing algorithms are bundled together
into specialized end-user configurable measurement tools that
can be easily set-up. Such tools should be plugins or modules
of a larger application where image acquisition, image display,
user interface and process control parts are shared/reused. Usually a graphical user interface is used where flowcharts depict
image processing pipeline, e.g. NI Vision Builder [5]. Such
environments provide many state-of-the art image processing
algorithms that can be chained together or are pre-assembled
into measurement tools, but there is no universally accepted
way of consistently addressing problems related to variations
in product position, orientation and size in the acquired image.
To further reduce the need for an image processing expert a
registration step that would remove variation due to product
position, orientation and size should be introduced. Included
registration step would enable simpler deployment of more
robust image processing solutions in discrete manufacturing1 .
In this paper we propose an acquire-register-analyze image
processing pattern that is especially suited to discrete manufacturing. Proposed acquire-register-analyze pattern aims to
increase reproducibility of an image processing flowchart by
consistently addressing variations in product position, orientation and size through the registration step that is implemented
in a way transparent to the end-user.
1 Discrete manufacturing is production of any distinct items capable of being
easily counted.
Proceedings of the Croatian Computer Vision Workshop, Year 1
For discrete manufacturing once an image is acquired
processing is usually done in two steps [6], first is object
localization that is followed by object scrutiny and measurement. More detailed structure of an image processing pipeline
is given in [1], where typical image processing steps are listed:
image acquisition,
image preprocessing,
feature or object localization,
feature extraction,
feature interpretation,
generation of results, and
handling interfaces.
A. Acquire-Analyze Pattern
Seven typical image processing steps can be decomposed
as follows: Image preprocessing step includes image correction techniques such as illumination equalization or distortion
correction. Today this is usually done by the acquisition
device itself, e.g. GenICam standard [3] requires tools that
are sufficient for intensity or color correction. Feature or
object localization is a first step of the image processing
chain and is usually selected on a case-by-case basis to adjust
for variations is object positioning. Feature extraction utilizes
typical processing techniques such as edge detection, blob
extraction, pattern matching and texture analysis. Results of
the feature extraction step are interpreted in the next step to
make actual measurements that are used to generate results
and to handle interfaces of an assembly line. We call this
image processing pattern acquire-analyze as feature or object
localization must be repeated for (almost) every new feature
of interest in the image2 . A structure of this pattern is shown
in Fig. 1.
Image acquisition
image processing chain
Feature localization
regions of interest
Feature extraction
feature image
Feature interpretation
Generation of results
Fig. 1.
Acquire-analyze image processing pattern.
2 This holds for discrete manufacturing. For other typed of manufacturing
processes registration is not always necessary, e.g. in fabric production feature
of interest is texture making registration unnecessary.
CCVW 2013
Poster Session
September 19, 2013, Zagreb, Croatia
Shortcomings of acquire-analyze pattern in discrete manufacturing are: a) image processing elements must be chosen
and chained together so variation in product placement does
not affect the processing, b) processing can be sensitive to
camera/lens repositioning or replacement during lifetime of
the assembly line, and c) for complex inspections image processing chain requires an image processing expert. Processing
can be more robust and the image processing chain easier to
define if a registration step is introduced in a way that makes
results of feature localization less dependent on variations in
When introducing a registration step several requirements
must be fulfilled. Firstly, registration must be introduced in
a way transparent to the end user, and, secondly, registration
must be introduced so unneeded duplication of the image data
is avoided.
B. Acquire-Register-Analyze Pattern
A common usage scenario for visual inspection software
in discrete manufacturing assumes the end-user who defines
all required measurements and their tolerances in a reference
or source image of the inspected product. Inspection software
then must automatically map the image processing chain and
required measurements from the reference or source image
to the image of the inspected product, the target image, by
registering two images [7]. End result of the registration step
is: a) removal of variation due to changes in position of the
product, b) no unnecessary image data is created as no image
transforms are performed, instead image processing algorithms
are mapped to the target image, c) as only image processing
algorithms are mapped overall mapping can be significantly
faster then if the target image data is mapped to the source
image, and d) no image interpolation is needed which can lead
to overall better performance as any interpolation artifacts are
eliminated by the system design.
In discrete manufacturing all products are solid objects so
feature based global linear rigid transform is usually sufficient
to map the defined image processing chain from the source
to the target image; that is, compensating translation, rotation
and scaling by using simple homography is sufficient. Required
mapping transform is defined by a 4 × 4 matrix T [8]. This
transform matrix must be transparently propagated through
the whole processing chain. If a graphical user interface for
defining the processing chain is used this means all processing
tools must accept a transform matrix T as an input parameter
that can be optionally hidden from the end user to achieve
We call this image processing pattern acquire-registeranalyze as feature or object localization is done once at the
beginning of the pipeline to find the transform T. A structure
of this pattern is shown in Fig. 2.
To test the viability and usefulness of the proposed acquireregister-analyze pattern a visual inspection software was constructed. Software was written in C++ and C# programming
languages for the .Net platform and is using Smartek GigEVision SDK [9] for image acquisition, and OpenCV [4] and
Emgu CV [10] for implementing image processing tools.
Proceedings of the Croatian Computer Vision Workshop, Year 1
the end-user, e.g. edge extraction, ridge extraction, object segmentation, thresholding, blob extraction etc. This is achieved
by representing all measurements by rectangular blocks. Inputs
are always placed on the left side of the block while outputs are
always placed on the right side of the block. Name is always
shown above the block and any optional control parameters are
shown on the bottom side of the block. Several typical blocks
are shown in Fig. 4.
Image acquisition
Feature localization
regions of interest
Feature extraction
feature image
Feature interpretation
image processing chain
global linear rigid transform
Fig. 2.
September 19, 2013, Zagreb, Croatia
Acquire-register-analyze image processing pattern.
Graphical user interface (Fig. 3) and logic to define an image
processing chain was written from scratch using C#.
A. User Interface
Interface of the inspection software must be designed
to make various runtime task simple. Typical runtime tasks
in visual inspection are [1]: a) monitoring, b) changing the
inspection pipeline, c) system calibration, d) adjustment and
tweaking of parameters, e) troubleshooting, f) re-teaching, and
g) optimizing. Those various tasks require several different
interfaces so the test application was designed as a tabbed
interface with four tabs, first for immediate inspection monitoring and tweaking, second for overall process monitoring,
third for system calibration and fourth for definition of the
image processing chain (Fig. 3).
Fig. 4. Example for four image processing tools. Inputs are always on the
left side, outputs are always on the right side and optional parameters are on
the bottom of the block.
B. Registration
Two blocks that are present in every inspection flowchart
are input and output blocks. For the acquire-register-analyze
pattern first block in the flowchart following the input block
is a registration block that internally stores the reference or
source image. For every target image the transform is automatically propagated to all following blocks. Note that actual
implementation of the registration is not fixed, any technique
that enables recovery of simple homography is acceptable.
Using OpenCV [4] a simple normalized cross-correlation or
more complex registration based on key-point detectors such
as FAST, BRISK or ORB can be used.
Figs. 5 and 6 show the source and the target images
with displayed regions of interest where lines are extracted
for the angle measurement tool. User can transparently switch
between the source and the target as the software automatically
adjusts the processing tool-chain.
C. Annotations and ROIs
One additional problem when introducing registration
transparent to the end-user are user defined regions-of-interest
(ROIs) and annotations that show measurement and inspection
results on a non-destructive overlay.
Fig. 3.
User interface for defining the image processing chain by flowchart.
User interface for defining an image processing tool-chain
must be composed so the end-user is able to intuitively draw
any required measurement in the reference image. Interface
should also hide most image processing steps that are required
to compute the measurement and that are not the expertise of
CCVW 2013
Poster Session
Every user-defined ROI is usually a rectangular part of
the reference or source image that has new local coordinate
system that requires additional transform matrix so image
processing chain can be mapped first from the source to the
target, and then from the target to the ROI. Composition
of transforms achieves the desired mapping. However, as all
image processing tools are mapped to the input image or to
defined ROIs and as there can be many different mapping
transforms the inspection result cannot be displayed as a nondestructive overlay unless a transform from local coordinate
Proceedings of the Croatian Computer Vision Workshop, Year 1
system to the target image is known3 . This transform to
achieve correct overlay display can also be described by a
4 × 4 matrix D that must be computed locally and propagated
every time a new ROI is introduced in the image processing
So for the acquire-register-analyze pattern elements of the
image processing chain can be represented as blocks that
must accept T and D as inputs/outputs that are automatically
connected by the software (Fig. 4), and that connections can be
be auto-hidden depending on the experience of the end-user.
September 19, 2013, Zagreb, Croatia
with a registration block that registers the prerecorded source
image to the target image and automatically propagates the
registration results through the processing chain.
Developed prototype software was accepted by the engineers and staff in the plant and has successfully completed the
testing stage and is now used in regular production.
A. Contact Alignment Inspection
There are two automated rotary tables with 8 nests for
welding where two metal parts are welded to form the spring
contact. There are variations in product placement between
nests and variations in the input image due to differences in
camera placement above the two rotary tables (see Fig. 8).
Image processing chain is composed of the registration step
that automatically adjusts position of subsequent measurement
tools that measure user defined distances and angles and
compare them to the product tolerances. This proposed design
enables one product reference image to be used where required
product dimensions are specified and that is effectively shared
between production lines thus reducing the engineering overhead as only one processing pipeline must be maintained.
Fig. 5. Tool’s ROIs for line extraction and angle measurement shown in the
source image.
Developed prototype software using the proposed acquireregister-analyze pattern is in production use at the ElektroKontakt d.d. Zagreb plant. The plant manufactures electrical
switches and energy regulators for electrical stoves (discrete
manufacture). Observed variations in acquired images that
must be compensated using registration are caused by product
movement due to positioning tolerances on a conveyor and by
changes in a camera position4 .
Regardless of the cause the variation in a product placement
with respect to the original position in the source image can
be computed. Thus every used image processing chain starts
3 Note
that this is NOT a transform back to the source image.
4 After regular assembly line maintenance camera position is never perfectly
CCVW 2013
Poster Session
Fig. 8.
Rotary table for contact welding.
Second assembly line
Fig. 6. Tool’s ROIs for line extraction and angle measurement shown in the
target image.
First assembly line
Fig. 7.
Variations in product placement.
B. Energy Regulator Inspection
There are four assembly lines where an energy regulator
for electrical stoves is assembled. At least seventeen different
product subtypes (Fig. 3) are manufactured. There is a variation in product placement due to positioning tolerances on the
Proceedings of the Croatian Computer Vision Workshop, Year 1
conveyor belt and due to variations in cameras’ positions across
all four assembly lines. Defined inspection tool-chains must
be shareable across the lines to make the system maintenance
September 19, 2013, Zagreb, Croatia
(2013, Aug.) NI Vision Builder for Automated Inspection (AI).
[Online]. Available:
E. R. Davies, Computer and Machine Vision, Fourth Edition: Theory,
Algorithms, Practicalities, 4th ed. Academic Press, 2012.
L. G. Brown, “A survey of image registration techniques,” ACM
Comput. Surv., vol. 24, no. 4, pp. 325–376, Dec. 1992. [Online].
J. D. Foley, A. van Dam, S. K. Feiner, J. F. Hughes, and R. L.
Phillips, Introduction to Computer Graphics, 1st ed. Addison-Wesley
Professional, Aug. 1993.
(2013, Aug.) GigEVisionSDK Camera Acquisition PC-Software.
[Online]. Available:
(2013, Aug.) Emgu CV .Net wrapper for OpenCV. [Online]. Available:
Fig. 9. Four different product variants (body and contact shape) of an energy
regulator recorded on the same assembly line.
Large number of product variants requires the user interface
that enables a flexible adaptation and tweaking of the processing pipeline. Here the graphical programming environment
(Fig. 3) was invaluable as it enables the engineer to easily
redefine the measurements directly on the factory floor. Furthermore, due to the registration step only one reference image
per product subtype is required so only seventeen different
image processing chains must be tested and maintained.
In this paper we have presented an acquire-register-analyze
image processing pattern that aims to increase the robustness
of an image processing flowchart by consistently addressing
variations in product position, orientation and size. Benefits
of the proposed pattern are: a) no unnecessary image data
is created from the input image by the application during
image processing, b) overall speed and throughput of the image
processing pipeline is increased, and c) interpolation artifacts
are avoided.
We have demonstrated the feasibility of the proposed
pattern through case-studies and real-world use at the ElektroKontakt d.d. Zagreb plant.
The authors would like to thank Elektro-Kontakt d.d. Zagreb and mr. Ivan Tabaković for provided invaluable support.
[1] A. Hornberg, Handbook of Machine Vision. Wiley-VCH, 2006.
[2] (2013, Aug.) GiGE Vision standard. [Online]. Available: http:
[3] (2013, Aug.) GenICam standard. [Online]. Available: http://www.
[4] (2013, Aug.) OpenCV (Open Source Computer Vision Library).
[Online]. Available:
CCVW 2013
Poster Session
Author Index
Banić, N. . . . . . . . . . . . . . . . . . . . . . 3
Brkić, K. . . . . . . . . . . . . . . . . . . 9, 19
Jurić, D. . . . . . . . . . . . . . . . . . . . . . 58
Cupec, R. . . . . . . . . . . . . . . . . . . . . 31
Kalafatić, Z.
Kitanov, A.
Kovačić, K. .
Krešo, I. . . .
Filko, D. . . . . . . . . . . . . . . . . . . . . . 31
. 9
Petković, T.
Petrović, I. .
Pinz, A. . . .
Pribanić, T.
. 9
Rašić, S. . . . . . . . . . . . . . . . . . . . . . . 9
Ribarić, S. . . . . . . . . . . . . . . . . . . . 15
Gold, H. . . . . . . . . . . . . . . . . . . . . . 25
Grgić, M. . . . . . . . . . . . . . . . . . . . . 53
Lončarić, S. . . . . . . . . . . . . . . . . 3, 58
Lopar, M. . . . . . . . . . . . . . . . . . . . . 15
Hrgetić, V. . . . . . . . . . . . . . . . . . . . 49
Muštra, M. . . . . . . . . . . . . . . . . . . . 53
Šegvić, S. . . . . . . . . . . . . . 9, 19, 37, 43
Ševrović, M. . . . . . . . . . . . . . . . . . . 43
Sikirić, I. . . . . . . . . . . . . . . . . . . . . 19
Ivanjko, E. . . . . . . . . . . . . . . . . . . . 25
Nyarko, E. K. . . . . . . . . . . . . . . . . . 31
Zadrija, V. . . . . . . . . . . . . . . . . . . . 37
Proceedings of the Croatian Computer Vision Workshop, Year 1
September 19, 2013, Zagreb, Croatia
Proceedings of the Croatian Computer Vision Workshop
CCVW 2013, Year 1
Publisher: University of Zagreb Faculty of Electrical Engineering and Computing, Unska 3, HR-10000 Zagreb, Croatia
ISSN 1849-1227
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF