Institutionen för systemteknik Department of Electrical Engineering Computer Vision

Institutionen för systemteknik Department of Electrical Engineering Computer Vision
Institutionen för systemteknik
Department of Electrical Engineering
Examensarbete
Image Segmentation and Target Tracking using
Computer Vision
Examensarbete utfört i Bildbehandling
vid Tekniska högskolan i Linköping
av
Sebastian Möller
LiTH-ISY-EX--11/4424--SE
Linköping 2011
Department of Electrical Engineering
Linköpings universitet
SE-581 83 Linköping, Sweden
Linköpings tekniska högskola
Linköpings universitet
581 83 Linköping
Image Segmentation and Target Tracking using
Computer Vision
Examensarbete utfört i Bildbehandling
vid Tekniska högskolan i Linköping
av
Sebastian Möller
LiTH-ISY-EX--11/4424--SE
Handledare:
Klas Nordberg
ISY, Linköpings universitet
Thomas Svensson
FOI, Totalförsvarets forskningsinstitut
Examinator:
Klas Nordberg
ISY, Linköpings universitet
Linköping, 9 May, 2011
Avdelning, Institution
Division, Department
Datum
Date
Computer Vision Laboratory
Department of Electrical Engineering
Linköpings universitet
SE-581 83 Linköping, Sweden
Språk
Language
Rapporttyp
Report category
ISBN
Svenska/Swedish
Licentiatavhandling
ISRN
Engelska/English
Examensarbete
C-uppsats
D-uppsats
Övrig rapport
2011-05-09
—
LiTH-ISY-EX--11/4424--SE
Serietitel och serienummer ISSN
Title of series, numbering
—
URL för elektronisk version
http://www.ep.liu.se
Titel
Title
Bildsegmentering samt målföljning med hjälp av datorseende
Image Segmentation and Target Tracking using Computer Vision
Författare Sebastian Möller
Author
Sammanfattning
Abstract
In this master thesis the possibility of detecting and tracking objects in multispectral infrared video sequences is investigated. The current method with fix-sized
rectangles have significant disadvantages. These disadvantages will be solved using image segmentation to estimate the shape of the object. The result of the
image segmentation is used to determine the infrared contrast of the object. Our
results show how some objects will give very good segmentation, tracking as well
as shape detection. The objects that perform best are the flares and countermeasures. But especially helicopters seen from the side, with significant movements,
is better detected with our method. The motion of the object is very important
since movement is the main component in successful shape detection. This is so
because helicopters are much colder than flares and engines. Detecting the presence and position of moving objects is easier and can be done quite successfully
even with helicopters. But using structure tensors we can also detect the presence
and estimate the position for stationary objects.
Nyckelord
Keywords
Tracking, IR, Computer Vision, Machine vision, Segmentation, Quadrature filters,
Background model, Frame differences
Abstract
In this master thesis the possibility of detecting and tracking objects in multispectral infrared video sequences is investigated. The current method with fix-sized
rectangles have significant disadvantages. These disadvantages will be solved using image segmentation to estimate the shape of the object. The result of the
image segmentation is used to determine the infrared contrast of the object. Our
results show how some objects will give very good segmentation, tracking as well
as shape detection. The objects that perform best are the flares and countermeasures. But especially helicopters seen from the side, with significant movements,
is better detected with our method. The motion of the object is very important
since movement is the main component in successful shape detection. This is so
because helicopters are much colder than flares and engines. Detecting the presence and position of moving objects is easier and can be done quite successfully
even with helicopters. But using structure tensors we can also detect the presence
and estimate the position for stationary objects.
Sammanfattning
I detta examensarbete undersöks möjligheterna att detektera och spåra intressanta
objekt i multispektrala infraröda videosekvenser. Den nuvarande metoden, som
använder sig av rektanglar med fix storlek, har sina nackdelar. Dessa nackdelar
kommer att lösas med hjälp av bildsegmentering för att uppskatta formen på
önskade mål.
Utöver detektering och spårning försöker vi också att hitta formen och konturen för intressanta objekt för att kunna använda den exaktare passformen vid
kontrastberäkningar. Denna framsegmenterade kontur ersätter de gamla fixa rektanglarna som använts tidigare för att beräkna intensitetskontrasten för objekt i
de infraröda våglängderna.
Resultaten som presenteras visar att det för vissa objekt, som motmedel och
facklor, är lättare att få fram en bra kontur samt målföljning än vad det är med
helikoptrar, som var en annan önskad måltyp. De svårigheter som uppkommer
med helikoptrar beror till stor del på att de är mycket svalare vilket gör att delar
av helikoptern kan helt döljas i bruset från bildsensorn. För att kompensera för
detta används metoder som utgår ifrån att objektet rör sig mycket i videon så att
rörelsen kan användas som detekteringsparameter. Detta ger bra resultat för de
videosekvenser där målet rör sig mycket i förhållande till sin storlek.
v
Acknowledgments
I would like to thank my supervisors Thomas Svensson and David Bergström for
assisting me with my work and also my supervisor and examiner Klas Nordberg
and opponent David Sandberg. Also of course my sources for helping with the
ongoing research in this field without whom this would not be possible. Thanks
to Mikael Möller and Anneli Gottfridsson for help with proofreading this report
and improving the readability.
vii
Contents
1 Introduction
1.1 Problem description . . . . . .
1.2 Data . . . . . . . . . . . . . . .
1.3 Glossary . . . . . . . . . . . . .
1.4 Notation . . . . . . . . . . . . .
1.5 Tensors . . . . . . . . . . . . .
1.5.1 Eigenvalues for Tensors
.
.
.
.
.
.
1
1
2
4
6
6
6
2 Preprocessing
2.1 Spectral mixer . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Median filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
13
17
3 Segmentation
3.1 Frame differences . . . . . . . . . . . . .
3.1.1 Algorithm . . . . . . . . . . . . .
3.1.2 Results . . . . . . . . . . . . . .
3.2 Background model . . . . . . . . . . . .
3.2.1 Algorithm . . . . . . . . . . . . .
3.2.2 Results . . . . . . . . . . . . . .
3.3 Structure tensor with image gradients .
3.3.1 Algorithm . . . . . . . . . . . . .
3.3.2 Results . . . . . . . . . . . . . .
3.4 Structure tensor with Quadrature filters
3.4.1 Algorithm . . . . . . . . . . . . .
3.4.2 Results . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
21
22
23
23
23
24
24
26
27
27
29
30
4 Post processing
4.1 Threshold . . . . . . .
4.2 Area filter . . . . . . .
4.3 Point representation .
4.3.1 Algorithm . . .
4.4 Mean Shift Clustering
4.4.1 Algorithm . . .
4.4.2 Results . . . .
4.5 Convex hull . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31
31
32
32
32
34
34
35
35
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
ix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
x
Contents
4.6
4.7
Tracking . . . . . . . . . . . . . . . . . .
4.6.1 Tracking to reduce false positives
4.6.2 Method . . . . . . . . . . . . . .
Quality measures . . . . . . . . . . . . .
4.7.1 Image variance . . . . . . . . . .
4.7.2 Object count . . . . . . . . . . .
4.7.3 Mean motion . . . . . . . . . . .
4.7.4 Area . . . . . . . . . . . . . . . .
5 Results
5.1 Segmentation . . .
5.2 Evaluation . . . . .
5.2.1 Comparison
5.2.2 Comparison
5.2.3 Comparison
5.2.4 Comparison
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35
35
35
38
38
38
39
40
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
of methods I, III and IV .
of methods II, III and IV
of methods I and II . . .
of methods III and IV . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
43
43
45
46
50
53
56
6 Further work
59
e
A Dual tensor, N
61
Bibliography
62
Chapter 1
Introduction
This master thesis will cover methods for segmentation and tracking in multispectral infrared images. Methods used are structure tensors based on image
gradients or quadrature filters; We also tested the method of background modelling
with either median estimations or frame differences. These methods were used to
create a framework that, as robustly as possible, detects objects of interest, with
minimal operator intervention. The paper will in chapter 1 start with a data
description section. Chapters 2 to 4 describe the methods mentioned above and
their ability to solve the problems of object detection and tracking along with some
quality measures, used to estimate the quality of the results achieved. In chapter
5 we report the combinations of methods that gave the best results for our data
and their corresponding quality measures. The last chapter contains some topics
for future research.
1.1
Problem description
The Swedish Defence Research Agency (FOI) wants to detect targets in multispectral infrared images. This is to do analysis of military targets and the primary interest is in infrared radiant intensity contrast which is used to create an IR
signature of the target. The IR signature in turn can be used to determine if countermeasures look similar to the vehicle they are protecting, similar IR signatures
improve the probability of successfully evading a threat.
The infra-red contrast is calculated by taking the difference of the average
intensity, of the found object, and the immediate background intensity and then
multiplying this difference with the area of the object. To do this we need to be
able to automatically track areas such as vehicles and flares. The current method,
known as peak filter at FOI, is to place a fixed size rectangle so that the sum of
all pixel intensities inside its area is as large as possible. This method has some
disadvantages, where the fixed-sized rectangle is the major issue. Objects generally
are not rectangular and also change in size due to rotation, changing temperature
or speed. The rectangle method also has problems with false positives and poor
tracking results where the rectangle may stick to a spurious pixel or not follow
1
2
Introduction
the target well. These disadvantages are currently fixed by operator intervention
which is a time consuming solution.
The report describe methods, for tracking and segmentation, that will perform
better than the older methods. These methods solve the problem of the fixed-size
rectangles by replacing them with a flexible contour of the same size and shape as
the object. For known objects there are preconfigured combinations of methods
and with new objects it is possible to create new combinations.
The aim of this report is to present how to estimate a mask, from data, that
can be provided to the operator. An example of an estimated mask is seen in
figure 1.2b where white are the sections of the image, 1.2a, that is to be kept and
black are the sections to be removed and the result is seen in figure 1.2c.
To summarise: Our aim is to separate the object(s) in an image, from the background, and to estimate a mask(s) that can be used for further image processing.
The system design to complete the tasks set up so far in the introduction is
summarized in the figure 1.1. As seen there the design is a pipes & filters design1
to allow for parallel processing and also to make each unit independent of each
other. Also the filters can be combined in a multitude of ways and some intriguing
possibilities shows up when we allow for feedback of image data.
Figure 1.1. A system summary showing the flow of data and the pipes & filters structure
of the system design.
1.2
Data
With a multi-spectral infrared camera, sensitive for wavelengths between 1.5µm
to 5.5µm, it is possible to detect objects that are not visible for the human eye
(a human eye is sensitive to wavelengths between 0.39µm and 0.75µm [1]). Our
camera uses a number of different spectral bands (n), usually between 3 or 4, to
be able to detect different objects. This means that we can, with the help of the
camera, see the thermal radiation of objects colder than the 525◦ C2 which is the
lowest temperature that is visible for the human eye (in optimal conditions). This
1 Pipes & filters are also known as pipeline and is a system design where the data flow through
pipes and encounters different filters on the path. This design pattern allow for concurrent
processing and easy maintenance since the filters are well separated.
2 Also known as the Draper point [2][3]
1.2 Data
3
Figure 1.2. original image, estimated mask and result after applying the mask.
applies to “black” objects that do not reflect any light but in reality there are
reflections in the infrared light that can both hide and reveal objects.
The acquired data is stored in a hypercube where each pixel is described by
a function f (x, y, t, i). Here x and y are image coordinates, t is time and i ∈
{1, . . . , n} is one of the spectral bands, where n is the number of spectral bands
available. The function f return a brightness value for the specified (x, y, t, i), if
i and t are constant but x and y varies we will retrieve an image at time t for
spectral band i.
The datasets presented in figures 1.3 to 1.6 are of the types that are commonly
analysed at FOI and where automation could potentially save many man-hours on
that analysis. In these images we can also see some typical properties of IR image
sensors. Due to their very high sensitivity they have a much higher problem with
“veils” that distort the background brightness level in the images. This can be
seen as a slow gradient across the image with a darker side and a brighter side.
Related to this is also the much lower signal to noise level that makes the Gaussian
noise in the image more visible than in most visible light cameras. It is also more
difficult to create error free sensors with the lower volumes they are produced in
which means that the chips are more likely to have dead/spurious pixels which is
seen in the image as salt & pepper noise.
Luckily we can see progress in the image quality over time since we have access
to both new and old data. The older image sensor produces data as seen in figure
1.3 where as the newer sensor produces all other images.
4
Introduction
figure 1.3
figure 1.4
figure 1.5
figure 1.6
1.3
This is the first dataset and is a test for how well any method
can handle dead pixels and noisy data. As seen in the figure,
the real signal is very similar to the noise from the image
sensor.
This dataset is more recent than that in figure 1.3 and, as
seen, much of the noise have been removed with an improved
image sensor.
The helicopter datasets show how similar a helicopter hull is
to the sky in the infra-red wavelength regions. This dataset
is a difficult test on how well the methods handle weak but
spatially large signals.
The last dataset type is the chaff data where many small signals is scattered in a cloud that has to be accurately tracked
without too much false positives and witout missing real signals.
Glossary
Dead pixels is pixels that give values that are unrelated to the image data. These
can usually be seen as salt & pepper noise in unfiltered example pictures.
They originate from damaged sectors on the image sensor chips.
Spurious pixels is pixels that like dead pixels do not contain data from the image
with one distinction. Spurious pixels vary over time.
Mask is an binary image that contain the shape of the detected object as well as
a determination if a portion of the image is part of the object or not. An
example of a mask can be seen in figure 1.2.
False positives are targets that the algorithms have found that does not actually
exist. These usually originate from dead pixels or noisy data. Post processing
and the quality measures are an important part in removing and identifying
false positives in the processed data.
False negatives are target that the algorithm did not detect, in this text simply
referred to as missed targets.
IR signature is a parameter that describe how a target is registered in the IR
wavelengths.
Chaff is a type of counter measure consisting of many small pieces of conductive
foil. An example can be seen in figure 1.6.
Flare is a hot burning piece of phosphorus with additives to make the IR signature
as similar as the defended target as possible. An example of this data can
be seen in figure 1.4.
1.3 Glossary
5
Figure 1.3. A sample of a dataset using an older camera as well as a longer distance,
also a flare.
Figure 1.4. A sample of a dataset from the better camera taken of a flare dropped from
an aircraft.
Figure 1.5. A sample of a dataset taken of a helicopter.
Figure 1.6. A sample of a dataset taken on chaff dropped from an aircraft.
6
Introduction
1.4
Notation
Due to the many different algorithms presented, sourced from different publications, there are some notational difficulties. However I try to be as consistent as
possible. In this paper the following notation will be used:
x, c
x, u
|u| , kxk2
kxk1
û
F, M
F
∗
∂
∂x
∇
T, N
kTk
e B
e
N,
F, F −1
I
E
O
1.5
Scalars
Vectors
pP
2
Magnitude of vector, |x| =P
i xi
1-norm of vector, kxk1 = i |xi |
Directional vector of unit length
Scalar valued functions on R2
the complement of image F . i.e 1 − F for normalized images
Convolution
smooth partial differentiation in x the coordinate
∂ T
∂
, ∂y
)
smooth gradient ( ∂x
Matrices and tensors
Norm of matrices and tensors
Dual tensors
Fourier transform and it’s inverse
Input image to the segmentation step
Segmentation output image
Post processing output image, mask
Tensors
In this paper the term tensor is a symmetric 2 × 2 matrix for each pixel where
the four elements represents the local region. There is many ways to compute the
tensor for images and the most common example is the structure tensor which
is generated by calculating the gradient of the image and then taking the outer
product of the gradient at each pixel as seen in equation 1.1 and figure 1.8.
!
2
Tg = ∇I ∇I T ∗ g =
∂I
∂x2
∂I ∂I
∂x ∂y
∂I ∂I
∂x ∂y
∂I 2
∂x2
∗g
(1.1)
Where g is a gauss window for local mean estimation.
When working with tensors all pixels will have four data values assigned to each
one of them and in this thesis the values have been separated so that all values in
position (1, 1) for all pixels is stored in the upper left quadrant and position (1, 2)
in the upper right quadrant and so on. This can be seen in the figures 1.8 and 1.9
where the structure tensor have been estimated for the image in figure 1.7.
1.5.1
Eigenvalues for Tensors
For tensors the eigenvalues λ1 and λ2 represent the intrinsic dimensionality of the
region around the pixel in question. We arrange the eigenvalues so that λ1 ≥ λ2 .
1.5 Tensors
7
Figure 1.7. IR image of a helicopter seen from the side. This picture is used for the
tensor calculation examples in figures 1.8 to 1.11
Now if λ1 and λ2 are small then the region around the pixel is a featureless flat
region, this case is called I0D or Intrinsically zero dimensional. If however λ1 > λ2
then the region contains a one dimensional feature i.e. a line. If both λ1 and λ2
are large then the region has a two dimensional feature such as a corner or two
lines crossing. To summarize:
I0D λ1 = λ2 = λ0
Planes, featureless areas
I1D λ1 > λ2 = 0
Lines and edges
I2D λ1 ≥ λ2 > 0 Crossing lines and corners
The orientation of these features is determined by the respective eigenvector v1
and v2 . It is however important to remember with which parameters the tensor
was estimated since it has an impact on which sized features that will be detected.
Algorithm
Recall:
T=
a
b
b
d
The eigenvalues of T are straightforward to compute by the following equation.
λ=
p
1
a + d ± a2 − 2ad + 4b2 + d2
2
(1.2)
8
Introduction
Figure 1.8. Example of the structure tensor Tg from image gradients. Note that salt
& pepper noise gives strong signals as well as the true target, the helicopter.
Figure 1.9. Example of the tensor Tq from quadrature filters. Note that this is a low
frequency filter and that only larger objects is seen.
1.5 Tensors
9
There is however 2 different cases for the eigenvectors. First case when b 6= 0 and
second case when b = 0, see table below.
b 6= 0
λ1 − d
b
1
0
b
d − λ2
0
1
v1
v2
b=0
However we do not need to explicitly calculate the eigenvalues for the image to be
able to estimate the intrinsic dimensionality of the neighbourhoods of all pixel’s.
We can save the computation of a square root and still get good results by using
the Harris operator [4],
R(T) = det(T) − κ trace2 (T) = ad − b2 − κ(a + d)2 = λ1 λ2 − κ(λ1 + λ2 )2
where κ is a tunable parameter usually in the interval 0.04 to 0.15.
10
Introduction
Figure 1.10. |λ1 | and |λ2 | of Tg from the tensor examples
1.5 Tensors
11
Figure 1.11. |λ1 | and |λ2 | of Tq from the tensor examples
Chapter 2
Preprocessing
In this first step we will change the input, a multi-spectral image, f into an
grey-scale image, I. Thus most important preprocessing step is to reduce the
amount of data by combining the brightness, bi = f (x, y, t, i), of spectral bands
1, . . . , n to simplify the calculations. That is, to combine the brightness values b1 , . . . , bn so that the pixels (x, y, t, b1 ), . . . , (x, y, t, bn ) are mapped to a pixel
(x, y, t, g(b1 , . . . , bn )) for all (x, y, t).
2.1
Spectral mixer
The reason we need the transformation g is that the image segmentation methods
we will use must have a scalar brightness, not a vector one. The transformations
we compare are the following three:
b̄1 + · · · + b̄n
n
median(b̄1 , . . . , b̄n ) and b̄k̂
and
where k̂ is the index for the brightness value given by max(|b1 |, . . . , |bn |). Now the
bi ’s for spectral band i and time t are pre-transformed as
b̄i = bi − µti
where we define
XX
µti =
x
f (x, y, t, i)
y
XX
x
(2.1)
1
y
1
as the mean of bi over x and y. This will reduce the complexity of having several
spectral bands with different brightnesses and allow the methods to work with
the more relevant adjusted brightness, b̄. The pictures in figure 2.1 show the
1 Medians have also been evaluated with similar results but with much longer calculation
times.
13
14
Preprocessing
bi −µi
−→
Figure 2.1. The left image shows µit for different times and for three spectral bands.
The right image shows the brightness b̄i after adjusting for the mean value.
adjustment visually, note that the left image is in a different scale than the right
one.
2.1 Spectral mixer
15
In figure 2.4 the result of different g’s is shown. The g settled for was absolute
maximum since it selects the bk̂ which have the greatest deviation from the mean,
zero. If we assume that brightness is roughly normal distributed (see figure 2.2)
then values in the tails should more likely be associated with an object.
Figure 2.2. Image histogram of a single frame showing roughly normal distributed
brightness values. The following property of normal distributions still holds, it is more
common the be close to the mean than further away.
This method, however, is not wholly satisfactory and results might be improved
if the method incorporated the actual wavelength, for each band, and the bands
were combined together in a more photometrically correct way. That is if we could
utilize the information in each spectral band to assign a temperature value to each
pixel, determined from the black body radiation assumption, this would require
some data from our camera filters and knowledge about black body radiation
physics. It is however not necessary since this simple and quick method gives
satisfactory results.
Figure 2.4. The resulting images after mixing the adjusted bands with mean, median and absolute maximum respectively.
Figure 2.3. A small cut-out of the original data with four spectral bands.
16
Preprocessing
2.2 Median filtering
2.2
17
Median filtering
Further preprocessing using median filtering can be useful to reduce salt & pepper
noise and spurious pixels, with variations in brightness, caused by camera and
filter errors. Since these pixels can cause false positives it is advantageous to filter
them out.
Disadvantages of median filtering is that it round sharp corners as seen in figure
2.6 and may remove details of the detected shape. Furthermore since median
filtering removes spurious pixels these cannot be detected later and may therefore
falsely be included in detected object shapes. This can be seen in figure 2.8 and if
we compare to figure 2.7 where the dead pixels have been removed from the mask.
This will give bias to the contrast calculations and is therefore a very significant
drawback.
The following figures will give an artificial example of this bias error in mean
brightness, µ. Mean brightness calculations with a mask is very similar to equation
2.1
XX
F (x, y) · M (x, y)
µ=
x
y
XX
x
y
where F is the image and M is the mask.
(2.2)
M (x, y)
18
Preprocessing
A test image, F , with salt & pepper noise and a square object in the middle is
created as seen in figure 2.5. To the right, in figure 2.6, is the same image F 0 after
median filtering where most noise is removed and the sharp corners of the square
are smoothed.
Figure 2.5. Original image, F
Figure 2.6. Filtered image, F 0
In the figure 2.7 is the optimal segmentation2 result, M , and in figure 2.8 the
median filtering segmentation result, M 0 . Note the black spots in the optimal
segmentation which indicate pixels that we do not have data for since they where
damaged.
Figure 2.7. Optimal mask, M
2 Created
Figure 2.8. Filtered mask, M 0
by hand to be the best possible result.
2.2 Median filtering
19
In figure 2.9 the foreground, F · M 0 of the image,F , using the median filtering and
segmentation result is seen. In figure 2.10 the background, F · M 0 is shown. The
object’s estimated mean brightness, µf , which will be used in contrast calculations,
is calculated as
P P
0
x
yF ·M
µf = P P
0
x
yM
and is in these images 1795, compared to 1000 which is the ground truth.
Figure 2.9. Foreground, F · M 0
Figure 2.10. Background, F · M 0
In the figures 2.11 and 2.12 the foreground and background of the image is calculated using the optimal segmentation. With this segmentation the mean brightness, µo , is calculated as
P P
x
yF ·M
µo = P P
x
yM
and is 1043, which is much closer to the ground truth of 1000.
Figure 2.11. Foreground, F · M
Figure 2.12. Background, F · M
Chapter 3
Segmentation
In chapter 3 we will discuss two methods for background removal: Background
models and frame differences. We will also look at two methods for object detection: Structure tensors from image gradients and structure tensors from quadrature filtering. The background removal is a way to split the image into two segments: Background and foreground. The foreground is expected to be the desired
target. Object detection may detect the presence and position of a target but not
the shape. These methods will all take a grey-scale image, I, as input and produce
an interest estimation image, E. In this interest estimation image, which is also
an gray-scale image, a brighter area is more likely to be of interest than a dark
one.
Each method gets a section where its implementation, strengths and drawbacks, are discussed. We need to discuss at least four methods because one method
is not enough to solve all the different problems with background removal and object detection.
The goal for these methods is to detect and separate objects such as, but not
limited to, helicopters, flares and chaff. This will result in a foreground estimate
describing which pixels are occupied by an object and which pixels are just background. The foreground estimation will however need some cleaning with post
processing to reducing false positives and improving the results.
3.1
Frame differences
Frame differences utilize that properties of objects in motion is different from
stationary objects, which are less likely to be interesting, and noise. This works
since taking the difference between two frames will only show the places where
there have been a change. The method has significant drawbacks since it cannot
handle stationary objects or objects that move slowly in relation to their size. The
algorithm is taken from [5].
21
22
Segmentation
Figure 3.1. Two IR images originally multi spectral but the spectrum’s have been mixed
together, the objects are a helicopter and a flare
Figure 3.2. Relevant parts of output from figure 3.1 after frame difference calculations
with k1 = 3, the flare is segmented very well but the helicopter is showing significant
shadows from older frames.
3.1.1
Algorithm
The algorithm is quite simple but there is room for extensions. We start with the
following definitions:
Ik = image numbered k
Fk = Ik − Ik−k1
where k1 is a tunable time offset and k belong to the set of all frames, {1, . . . , n}
where n is the total frame count in the dataset. The source[5] used a second frame
differences step to get a temporal second derivative, which however is not used in
this report. To remove spurious pixels the algorithm remove those pixels whose
absolute value is larger than the user supplied parameter max
Ek =
Fk , if |Fk | < max
0,
if |Fk | > max
3.2 Background model
3.1.2
23
Results
For objects moving fast relative to their size this method performs well and, as seen
in the result image in figure 3.2, most of the noise from figure 3.1 is stationary and
is filtered away quite nicely. The motion requirement makes this method, by itself,
less useful for larger objects. This is because such objects will cast “shadows” on
other parts. This disadvantage makes the segmentation results unreliable for large
objects.
3.2
Background model
This method is taken from [6], original source is [7], but with an added trick also
suggested in the book [6] but sourced from [8], with incrementally updating a
median estimation. This median estimation is used to estimate the foreground by
subtracting the background estimate. This method will work best for stationary
cameras or when we have a homogeneous background with a moving target. For
good data the method is also very easy to tune and can be quite fast if a small
number of images, k, is used.
3.2.1
Algorithm
There are two tuning parameters: The first one, β, is the number of frames used for
each estimate and the second one, α, is how fast the background will be updated.
Let as before {Ik }nk=1 denote our consecutive frames. The algorithm works as
follows
1. initialize k = β
2. estimate the background
Bk = median (Ik−β+1 , Ik−β+2 , . . . , Ik )
3. increment k by one
4. estimate our foreground
Ek = |Ik − Bk−1 |
5. update the background
Bk = (1 − α)Bk−1 + α median (Ik−β+1 , Ik−β+2 , . . . , Ik )
6. if k < n return to item 2.
Another method is to incorporate the standard deviation for each pixel in the
background. This method adds two steps to the algorithm above, similar to the
method by [9] but with a single Gaussian distribution per pixel instead of several.
1. initialize k = β
24
Segmentation
2. estimate the background
µk = median (Ik−β+1 , Ik−β+2 , . . . , Ik )
3. estimate the standard deviation
σk = std (Ik−β+1 , Ik−β+2 , . . . , Ik )
4. increment k by one
5. estimate our foreground
Ik − µk−1 Ek = σk−1 6. update the background
µk = (1 − α)µk−1 + α median (Ik−β+1 , Ik−β+2 , . . . , Ik )
7. update the standard deviation
σk = (1 − α)σk−1 + α std (Ik−β+1 , Ik−β+2 , . . . , Ik )
8. if k < n return to item 2.
3.2.2
Results
Overall this background model supersedes the frame difference model in detecting
objects as well as having less false positives. It is however slower which, while not
a concern for this implementation, could be of interest in future development. For
sequences where the target is moving around in the frame the segmentation can
be very good as seen in figure 3.3. For sequences where the camera tracks the
object well we get the side effect of swapping what is foreground and background,
i.e. the helicopter is believed to be the background and the sky the foreground. In
figure 3.4 there is an example of the poor segmentation that is the result.
3.3
Structure tensor with image gradients
This is a method for estimating the structure tensor 1 for the neighbourhood at
each pixel. The method can be tuned for object size but, to save calculation time,
for large objects scaling down the image is recommended.
1 for
a short explanation of the concept of tensors see the introduction 1.5
3.3 Structure tensor with image gradients
25
Figure 3.3. Here are the original image, the background estimate, the difference and
the estimated shape of the helicopter. As seen the resulting shape is very close to the
actual shape in the original image.
Figure 3.4. A video sequence where some pixels are poorly segmented because the
helicopter occupy them very often. The artefacts that cause the problem is clearly seen
in the background estimation.
26
Segmentation
Figure 3.5. On top the two functions g and gx respectively that when convolved will
produce the bottom convolution kernel, g T ∗gx , that is used when estimating a continuous
derivative over an image.
3.3.1
Algorithm
We will estimate the image derivatives by convolving the image with a 2D signed
Gaussian kernel, created from two separable 1D kernels, which will be cut off in
the spatial domain at ±3 standard deviations, which is sufficiently large, where
the standard deviation is tunable so that we can change how the filter will react to
different sized objects. Large variance is used to diminish the response from small
objects and amplify the response from large objects since a larger neighbourhood
is taken into consideration with larger variance.
First construct the separable filter kernels given the variance σ and extract
values that are within 3 standard deviations to get a limited size, pictured on top
in figure 3.5.
−0.5x.2
e σ2
g(x) = √
σ 2π
gx (x) = −gxσ 2
The Gaussian kernels are used to estimate continuous derivatives on the discrete
3.4 Structure tensor with Quadrature filters
27
data by convolutions:
∂I
= I ∗ g T ∗ gx
∂x
∂I
= I ∗ g ∗ gxT = I ∗ (g T ∗ gx )T
∂y
There is an example of there operations in figures 3.6 and 3.7.
derivatives are used to estimate the structure tensor Tg :
Tg = ∇I ∇I T ∗ g =
∂I 2
∂x2
∂I ∂I
∂x ∂y
∂I ∂I
∂x ∂y
∂I 2
∂x2
The image
!
∗g
Here we will use the structure tensor to estimate the local intrinsic dimensionality of the image, i.e. the local complexity of the image. A more complex part
of the image is deemed more likely to be interesting to track. This is done by
calculating the eigenvalues of the tensor or simpler by using the Harris operator.
Both of these concepts are further explained in the introduction 1.5.
3.3.2
Results
The method of structure tensors with image gradients gives good results in detecting objects that are spatially large since the desired target size can be tuned
to ignore the much smaller salt & pepper noise. However it will not perform well
if the desired target is of similar size as the noise. The method detects the position and presence of the object, it is especially useful if the proper shape cannot
reliably be estimated. The reason why the method cannot detect the shape of the
object depends on that the edges around the object is of similar frequencies as the
noise. This disadvantage makes the method a poor choice for image segmentation
but it is still useful when other image segmentation methods fail to acquire an
estimate for position and presence. The conclusion is that the method is more
robust concerning position and presence detection.
3.4
Structure tensor with Quadrature filters
Most of the quadrature filters implementation is sourced from [10] where many
methods for estimating local orientation is described but the original source is
[11]. The result from this method is an estimation of the structure tensor T of the
local neighbourhood for each pixel. From this structure tensor we can calculate
eigenvalues which describe the local intrinsic dimensionality and the norm of the
tensor. The quadrature filter has a tunable central frequency which is used to
filter out object of undesired spatial sizes, which can be very useful when we want
to find objects that are large in a noisy image but where the noisy pixels are small,
such as salt & pepper noise. The method is very similar to the image gradients
with structure tensor method, but with a different way of estimating the tensor.
28
Segmentation
Figure 3.6. Image used for image derivatives example, note the multiple engines of the
aircraft as well as some salt & pepper noise which all give strong signals
Figure 3.7. Image derivatives from the test image in figure 3.6, Ix0 and Iy0 respectively
note that the signal from the engines is stronger than that of the salt & pepper noise
because of the large variance σ in g. Unfortunately there is some edge effects visible that
can cause false detections if they are not filtered away.
3.4 Structure tensor with Quadrature filters
3.4.1
29
Algorithm
The theory about quadrature filtering of images is best explained in [10] but a full
account for it is too much for this text. However the algorithm can be used and
implemented from the following description.
The algorithm begins by obtaining Ri , central frequency, and B, bandwidth,
from the user where Ri determines the size of desired objects and B the variance
within the size. These parameters are problem specific, with different values the
filter has different properties, for example with large Ri salt & pepper noise is very
strong and needs to be removed in preprocessing. After getting the parameters
from the user the filters Hn are created in the digital Fourier domain with the
same size as the frame to be filtered. This method uses a tunable count of filter
orientations n where the filter orientations represents the direction of interest
and lines orthogonal to this direction will give stronger responses. Since −n̂ and
n̂ will result in the same filter response magnitude, only directions from [0, π) is
2
considered and as such the directions {0, π4 , π2 , 3π
4 } are selected and filter direction
vectors n̂ generated of unit length. The quadrature filter Hn is defined from
two functions R and D, a radially symmetric function and a directional function
respectively:
Hn (u) = R(|u|) · D(û)
where u is the 2D frequency coordinate vector and R and D are defined as:
−4
ln2 (
|u|
)
Ri
R(u) = e B2 ln 2
2
(û · n̂) if û · n̂ > 0
D(u) =
0
if û · n̂ < 0
Then for all frames I we apply the following equation:
qn = |F −1 (F(I) · Hn (u)|
(3.1)
where qn is the filter response for filter direction n, this is then used to estimate
the tensor Tq .
X
e n · qn
Tq =
N
n
e is a set of dual tensors, where [10] explains how they are determined in the
N
general case, a recipe is provided in appendix A, and for the set of directions
e n is:
picked here N
e1 =
N
e3 =
N
3/4
0
−1/4
0
0
−1/4
0
3/4
1/4
1/2
1/4
−1/2
e2 =
N
e4 =
N
1/2
1/4
−1/2
1/4
2 The number of directions is tunable and more directions will get slightly better results at a
trade off for longer calculation times
30
3.4.2
Segmentation
Results
Salt & pepper noise gives very strong responses when B and Ri are tuned for
small objects much like the results for the gradients method. As seen in the result
image in figure 3.8 large objects can be tracked satisfactory but as with the image
gradients the contour of the object must be extracted with another method. There
is another problem that arises from the digital Fourier transforms, object close to
the edges will give strong filter responses on the other side of the image where
there is no target due to the cyclic nature of the transform. This problem can be
solved by filtering in the spatial domain i.e. replacing the equation 3.1 with:
qn = |I ∗ F −1 (Hn )|
Figure 3.8. After filtering the images on top, with quadrature filters, we get the images
below. The parameters where B = 2.2 Ri = 2.2 for the helicopter image and B =
5 Ri = 170 for the point object image
Chapter 4
Post processing
The segmentation algorithms presented, in the previous chapter, all end up with
an grey scale image, E, where white signify foreground and black signify background. This is not good enough for obtaining good object shapes since there is
noise present and the desired output, O, is an binary mask with only two values,
therefore further processing of the shapes is necessary. First a threshold is applied
to the foreground estimate so that we get a boolean value that determines if a pixel
is part of an object or not. Then we use different filtering methods, depending on
the data at hand, to further process the image. Finally we estimate the quality of
our results.
4.1
Threshold
The problem of setting a proper threshold is very important since most algorithms
in the end will need a conclusive answer and the only way to get it is to determine
what does and does not pass. There are several methods implemented that are
usable in different scenarios. The simplest method is to give an interval for the
brightness Exy of pixel (x, y). Each pixel with a brightness inside the interval
(tl , th ) are turned on and the others are turned off,
1 tl < Exy < th ,
Oxy =
0 otherwise.
This method has the advantage that if the target moves out of the frame we will
not detect any new targets. But the down side is that the user must supply good
values for tl and th .
A more robust method for producing an binary image is following threshold
function,
1 tl (maxx,y (E) − µ) ≤ Exy + µ ≤ th (maxx,y (E) − µ)
Oxy =
(4.1)
0 otherwise
where the parameter µ = mean(E).
31
32
Post processing
The choice for tl and th is not as sensitive in this latter method as in the
previous one. The downside with this method is that it will usually find new false
targets if the real target moves out of the frame since when the target is outside
the frame at least the mean value drops.
Another method that have also be attempted is to use the standard deviation
for selecting the brightest target. In this case we have
1 tl σ + µ < Exy < th σ + µ
Oxy =
0 otherwise
where σ = std(I). This method has no advantages over the previous one in 4.1.
4.2
Area filter
After having done the threshold filtering we may use area filtering to remove
objects that have too large or too small area. Since such objects either are noise
or, for example, a cloud. We first label the 8-connected objects in the image using
the algorithm from [12]. With labeling complete we have a number of separate
objects for which we count the number of pixels belonging to each separate object.
Then we remove any object which has too few or too many pixels.
This method is very efficient because it disposes of the most common false
objects, those that are smaller than the desired target size as well as those that
are larger.
4.3
Point representation
The detected objects should get a singular position to assist tracking. Centre of
mass is one method for this that works well and is fast both to implement and to
calculate. Another method is binary shrinking1 of areas until only one single pixel
of each object remains, see the example in the algorithm section.
The centre of mass method has the advantage of speed whereas binary shrinking
can separate two objects that might have been too close for the segmentation
methods to separate. The prerequisite for successful separation is that the binary
image is concave since convex shapes, when shrunk, will end up with only one
point.
4.3.1
Algorithm
Both centre of mass and binary shrinking is quite straightforward to implement.
For the centre of mass method we define the following variables for object Ok
mxy = value at pixel (x, y) ∈ Ok
rxy = position for pixel (x, y) ∈ Ok
1 Binary
shrinking is also known as ultimate erosion[6].
4.3 Point representation
33
and then define the centre of object Ok as
rCM
P
mxy rxy
= P
mxy
observe that the object index k has been suppressed and that mx y usually is, but
does not have to be, one.
The binary shrinking method is the same as the morphological method “erode”
applied over and over again until each object consists of only one single pixel. This
algorithm also goes under the name ultimate erosion.
Below is a constructed example image with objects that have not been successfully separated. The left image shows the centre of mass for the detected objects
and the right image the centre of mass after binary shrinking. The intermediary
steps in the algorithm is shown below with a few steps to the left and the final
result in the image to the right.
34
Post processing
4.4
Mean Shift Clustering
Mean shift clustering uses the mean shift algorithm from [6], with primary sources
[13][14], to determine which objects belong together. This can be very useful for
clouds of objects that is detected individually but create a larger object that is
more interesting when all the sub-objects are taken into consideration together and
at the same time removing objects that do not belong to the larger cloud2 . This is
done by assigning each small object into a basin of attraction and the basin which
collect the most detections is deemed the most interesting one and all detections
that fall into it is kept and any detection that fall into another basin is rejected.
To assign a basin of attraction to a detection we iteratively apply a kernel function K which determine the behaviour of the method. Here K(x) = exp(−0.5|x|2 )
is used.
4.4.1
Algorithm
The first step is to get point representations for all objects and xi will represent
the coordinates for object i. yjp is the origin of the kernel K for iteration j and
object p. The two user parameters min and h where min determines when the
algorithm stop, usually a small number around 1 is sufficient, and h determine the
kernel bandwidth which is more difficult to tune well, for small values of h the
objects must be closer together than for larger h to be in the same basin.
1. Initialize position of kernel yjp = xp .
2. Calculate a better position for the kernel by determining the centre of mass
for the objects xi where the mass is the kernel response to the distance
between yjp and xi , weighted with the kernel bandwidth h:
p
yj+1
=
n
X
i=1
xi K
|yjp − xi |
h
!,
n
X
K
i=1
|yjp − xi |
h
!
p
3. Calculate the distance d = |yj+1
− yjp | from the kernel motion and use it for
a stop criteria: if d > min continue refining the kernel position by returning
to item 2.
4. Now that the kernel have converged to the position yp and this will be the
first basin. We assign xp to this basin and any other points that converge to
the same position will belong to the same basin. Any other converge points
will create new basins to collect objects in.
5. Finally select the basin with the most objects in it for the foreground estimation, all other basins are discarded.
2 For
examples of this dataset see the figure 5.2.2 on page 51
4.5 Convex hull
4.4.2
35
Results
This method is very useful for data which has a lot of separate parts where the
constituent parts are very small as seen in figure 4.1. This situation makes it hard
to use area filtering to remove false positives since they are of the same spatial
size as the desired targets. But mean shift works well in detecting which targets
belong together and which targets are false positives.
4.5
Convex hull
For some sequences there are many targets and detecting them individually may
be less useful than determining a region that span all detections.
The method attempted here is to determine the convex hull of the detections
that are close together. This is done by getting a distance threshold from the user,
sorting all detections in a list and removing those which are too far from their
neighbours. Unfortunately the results vary too much to be very useful as seen in
the two example images in figure 4.2.
4.6
Tracking
Tracking is used as an aid in both determining if what the segmentation has found
indeed is an interesting object or not and assist in improving the segmentation
results. This tracker assisted segmentation could on the first attempt find a small
part of a vehicle3 . The tracker could then determine that this small part is of
interest and a second iteration of segmentation could then focus on this point.
4.6.1
Tracking to reduce false positives
The general idea of tracking is to determine from the behaviour of a detection
whether it actually is an object. False positives are expected to either move erratically, be stationary or be short lived since sensor noise that is similar to the
desired signal will not move across the frame in a smooth fashion. The short lived
targets and stationary ones are both attributed to flickering pixels. Good signals
are expected to move smoothly and exist for a longer duration. Tracking can also
be used to determine the identity of new objects by asserting that the new object
is actually another older one which have been occluded temporarily.
4.6.2
Method
The current method for tracking is quite crude and could use some improvements
but it performs well enough on the data at hand. The method determines which
objects in a new frame is the same ones as in the previous by comparing their
3 On powered vehicles there is usually some exhaust and engine parts that is much hotter than
the majority of the vehicle.
36
Post processing
Figure 4.1. The result after filtration with the mean shift algorithm. False positives
not situated close to the countermeasures cloud is removed leaving a clearer picture as a
result.
4.6 Tracking
37
Figure 4.2. Two consecutive frames from a video where the convex hull method have
been applied. As seen the results are quite unreliable for this data where the top frame
has a good result the next frame includes much area that is not related to the target.
This is happens too often for the method to be very useful.
38
Post processing
positions. I and J is the current and previous frames respectively and in and jn
represents the positions of the objects in them. We then create the matrix M:


ki1 − j1 k2 ki1 − j2 k2 . . .


M =  ki2 − j1 k2 ki2 − j2 k2 . . . 
..
..
..
.
.
.
If the two frames are similar M will be a square matrix with small values in
the diagonal and larger values everywhere else. If there is different numbers of
objects in J compared to I, M will not be square and we will have to select best
matches first and create or remove objects that have no similar detection in the
neighbouring frame.
If these steps are done on all frames in a dataset we can then follow any object
in the video and calculate the infrared contrast for any object in each frame.
4.7
Quality measures
It is useful to estimate how well the methods have detected the objects in the scene
and therefore a few quality estimation measures are usually applied to the video
sequences. There are two examples of quality measures with datasets from flares
and helicopters in figures 4.4 and 4.3 respectively. There are also many more in
the results section with images from the datasets to accompany them.
4.7.1
Image variance
A decent measure of how easy it is to obtain a good and clear mask of the foreground objects is to check the spatial standard deviation, σ, of the entire foreground estimation image:
v
u
n
u 1 X
(bj − µ)2
σ=t
n − 1 j=1
where n is the number of pixels in the image, µ is the mean value of all pixels
in the image and bj is the intensity of pixel j. A large variation usually means
that we have more leeway if we want to use a threshold and will still get a decent
result. This estimation method has been used a lot and has given good results
for estimating how difficult it will be for the operator to set a proper threshold
and subsequently the quality of the segmented data. It is however, much like most
methods, not foolproof and cannot easily be used alone to automatically determine
if the segmentation is likely to succeed or not. This is because the magnitude of the
variance is different for different targets and video sequences but some processing
of this data might solve this problem.
4.7.2
Object count
A simple count of the number of distinct objects detected at each frame can tell if
the resulting segmentation has performed well or not. We often know in advance
4.7 Quality measures
39
how many interesting objects there are in a video sequence and therefore can
determine which frames that have bad data in them. However there is always the
risk that the detected number of objects is good but the actual objects detected
may still be wrong.
Algorithm
The method tested are to count the number of distinct 8-connected neighbourhoods there are in each frame and suggest that this number is the same as the
number of objects. More advance methods that merge close objects if they are
similar, such as the smoke of a flare which should be counted as a part of the flare
and not as a separate object, might get a better estimation but this method is still
very useful.
4.7.3
Mean motion
This method gives a simple measure of how much motion there is in the frame
where too much or too little motion can give hints whether the detection is finding
too much noise or that there are dead pixels that have been confused for objects.
There was no sequence where this was a significant problem and as such the quality
of the quality measure has not been determined.
Algorithm
For this method a very rough estimation method is used:
• If the number of objects in frame one is the same as in frame two.
• Calculate the centre of mass for all objects.
• sort the coordinates, for each frame, in decreasing magnitude. i.e. (11, 13)
and (15, 20) into a single, sorted, array 20, 15, 13, 11.
• sum the absolute difference of each frame with its successor and divide with
the number of objects.
Example with two objects p1 and p2 with positions (11, 13) and (15, 20) in the
first image, f1 , and (13, 15) and (15, 24) in the second, f2 :
f1 f2 |f1 − f2 |
20 24
4
15 15
0
13 15
2
11 13
2
2 k1
for this data the estimated mean motion will be kf1 −f
= 4+0+2+2
or 4 pixels.
n
2
However note that this is not Euclidean distance, 2-norm, but Manhattan distances, 1-norm, but it is still useful since large changes in position will give large
values regardless.
40
4.7.4
Post processing
Area
The number of on pixels in a mask can be used to determine if it is the same object
that is detected in neighbouring frames since a large change in area probably means
that there is a new object that is being tracked. The area measure can also be
used to determine how well the shape segmentation has performed since a very
small area for a helicopter would imply that only the engine has been detected.
4.7 Quality measures
41
Figure 4.3. The binary image quality measures for one of the helicopter passage video
sequences. Some missing data can be seen in the movement measure due to some separation in the segmentation of the helicopter propeller. The area measure tells us that the
first few frames are probably bad and at the middle of the video the helicopter passes
partially out of frame.
Figure 4.4. All quality measures for one video sequence with a flare dropped from an air
plane. It may seem as the measures indicate a bad segmentation but much data comes
from the four engines of the air plane that are strong heat sources that are clearly visible
in the video sequence.
Chapter 5
Results
Previously we discussed image segmentation algorithms and found that they gave
very different results. To summarise our results we introduce the following two
concepts:
Good data is when the object is moving around in a stationary or homogeneous
background and preferably with a large contrast (positive or negative makes
no difference).
Bad data is when the brightness contrast is very low, such as when we have short
wavelengths or large distances. At large distances much of the light is lost
due to the inverse square law and atmospheric absorption. The segmentation
is also problematic when objects are stationary or move slowly in the image.
The results for good data are very satisfactory and should help in the contrast
calculations at FOI.
5.1
Segmentation
The segmentation results for flares are very good. The shape of the flare may
be followed as accurate as the operator decides (see figure 5.1). In figure 5.1 all
objects, in the frame, are detected and we can see clear signals both from the
engines of the aircraft, that drops the flare, and the flare itself.
Helicopters can be quite difficult to track and, for bad data, good shape segmentation was next to impossible. For good data it was quite easy to get a good
shape segmentation and the results were satisfactory (see figure 5.2a). Here our
quality measures from 4.7 are useful in determining how good the results are.
For both helicopters and flares the same combination of methods were used
but with different settings for the parameters. It is possible to get better results
if we are only interested in one type of object. But it is time consuming to find
that specific combination of methods that is better than the one presented here.
The first combination method, I, used for flares and helicopters are:
43
44
Results
Figure 5.1. A standard flare video sequence result after segmentation. All flare data
have results similar to this if method I is used.
Figure 5.2. pictures of both good and bad segmentations of helicopters. The bad
segmentation is from a helicopter heading towards the camera and the helicopter front
has a much lower temperature than the sides as well as being quite stationary in the
image which makes the detection much more difficult for the motion sensitive methods.
5.2 Evaluation
45
Figure 5.3. Two frames from the helicopter sequence with only short wavelength image
data used. As we can see the segmentation is significantly worse than the case with
several spectrum of data.
Spectral mixing → Background models → Threshold → Area filter
The second combination method, II, used for chaff data is:
Spectral mixing → Background models → Threshold → Mean shift
The third combination method, III, is used for all data:
Spectral mixing → Threshold
The fourth combination method, IV, is used for all data:
Spectral mixing → Peak filter
The third method is introduced as a control to evaluate how much better the
methods I and II are to some very simple method. The fourth method is to
asses how the new methods, one and two, perform compared to the rectangle
method presented in the introduction. The settings at each filtering step are
quickly configured with aid of intermediary data. To evaluate the methods we use
the quality measures introduced in section 4.7.
At FOI there is also some interest in how well the methods will perform if the
spectral bands are limited to only the short wavelengths, which gives much less
contrast from the intended target and is a quite hard spectrum to track in. This
was tested by removing all other spectral bands in the spectral mixing step and
continue as usual after that. The results with such data is worse than if all spectra
are used and, for helicopter data, only the helicopter engine can be tracked with
any reasonable accuracy. When we lower the threshold value the hull may be seen
in some images (see figure 5.3) but the images vary too much for the results to be
reliable. However the flare data is always well segmented.
5.2
Evaluation
In the following we will compare the different methods for the three data types
flares, helicopters and chaff. The quality measure “image variance” is omitted
46
Results
in these evaluations since it did not provide any more information beyond the
quality measure “area”. It is also important to remember that method IV has
the disadvantage of the fixed size rectangle that does not represent the detected
objects very well.
5.2.1
Comparison of methods I, III and IV
On the next 3 pages we will compare method I with the two controls methods, III
and IV, on three different datasets.
Set 1
The first dataset is a flare dropped behind an aircraft where the camera follows
the flare. This leads to that the aircraft disappears and that the flare motion is
quite jittery, since human operator adjusts the camera intermittently. This jittery
motion is seen in the quality estimation “Mean movement” for method I and
intermittently for method IV. Method III gives much lower values than method I
and III, this is because most of the detections are stationary noise which reduce
the mean movement in the image. Method I is the only one which gives the correct
object count during the entire video where method III is grossly over estimating
due to false positives and method IV, since it must have a fixed number of objects,
has the wrong number most part of the sequence.
Set 2
The second dataset is from an older camera and further away from the object.
The flare, marked in the images with an ellipse, is much dimmer and smaller than
in the first dataset, which makes the segmentation more difficult. In the quality
measures we can see that method I mostly detect the correct number of objects
where the errors are caused by hot smoke from the flares. Methods III and IV
suffer from the more difficult conditions and cannot give any usable results.
Set 3
The third dataset is a helicopter that moves out of frame for a duration, taken with
the new camera. Here we can see that both methods I and IV estimate the motion
of the helicopter rather well but with a distinction. Method IV gives false motion
and false objects during the time when the helicopter is not seen but method I
correctly determines that there is no object in the image at that time. For method
III this dataset is very difficult since the contrast between the helicopter and the
background is so low. This gives many false positives that impact the results.
5.2 Evaluation
47
Area
Mean movement
Object count
I
III
IV
Always 651
Always 1
48
Results
Area
Mean movement
Object count
I
III
IV
Always 121
Always 1
5.2 Evaluation
49
Area
Mean movement
Object count
I
III
IV
Always 2911
Always 1
50
5.2.2
Results
Comparison of methods II, III and IV
When comparing method II to methods III and IV we will look at chaff data
which is the dataset for which method II was designed. As seen in the object
count quality measure the methods II and III performs quite similarly but with
the distinction that method II removes some false positives. The similarities is to
be expected since, as we can see in the dataset images, the chaff is bright on a
dark background, which is the ideal case for method III.
Note that method IV detects the motion of the chaff well when it is bright and
tight together but has problems as soon as the chaff cools down and disperses in
the air.
5.2 Evaluation
51
Area
Mean movement
Object count
II
III
IV
Always 3876
Always 1
52
Results
Area
Mean movement
Object count
II
III
IV
Always 3876
Always 1
5.2 Evaluation
5.2.3
53
Comparison of methods I and II
The difference between methods I and II is relevant to emphasize since they involve
using similar algorithms and method II is significantly slower. On the next two
pages are two datasets each utilizing the strengths and weaknesses of each method
so that the differences is as clear as possible. And we can see that each method
perform best at their respective designed target.
On the first set, method II keeps small detections and correctly estimate the
area of them all. On the second set method I successfully removes all false detections.
54
Results
Area
I
II
Mean movement
Object count
5.2 Evaluation
55
Area
I
II
Mean movement
Object count
56
5.2.4
Results
Comparison of methods III and IV
The two control methods both give adequate results on some specific data in some
specific case. Both methods however suffer greatly in the robustness criteria since
they cannot handle more than good data in specific cases. The next two pages
show two datasets where one method performs well and the other does not.
5.2 Evaluation
57
Area
Mean movement
Object count
III
IV
Always 2911
Always 1
58
Results
Area
Mean movement
Object count
III
IV
Always 651
Always 1
Chapter 6
Further work
There are quite a few angles that would benefit greatly from further investigation
and testing:
Edge based segmentation Some edge based segmentation that do not rely on
motion as heavily as the current best methods would be very useful. This
could make it possible to better track objects with little motion relative to
their size or if there is a very cluttered or moving background.[6]
Improved and more robust tracking The current tracker is very rudimentary
and sensitive to multiple detections or missing detections. There is a strong
multi target tracker in development at FOI that was attempted but should
be investigated further since it could improve the results significantly.
Multi scale operations To do some calculations at several scales might improve
the results and speed of calculations. The structure tensors in particular has
little need for full scale image data when they are looking for large objects.
Automated runs FOI is very interested in the possibility of automating the
procedure further so that they could segment, detect object and perform
background contrast calculations in large batches without operator intervention. This area could use the quality measures described in chapter 4 to
give feedback on the results and determine if there is any corrections needed
for some video.
Active contours segmentation To get a good tracking point is easier, using
structure tensors for example, than getting the entire contour. So to use
active contours or snakes starting at guessed interest points and retrieve the
object shape this way could get better and more robust data.[6]
Multi spectral data An attempt at combining multi spectral data in a more
scientific manner and study if there is any significant improvements in detection rates. A better mixing of the spectrum might achieve a better signal
to noise ratio and detect more parts of objects that is currently occluded
behind noise.
59
60
Further work
Real time tracking Most methods are, if possible, currently written with this in
mind and do not use “future” data in the calculations so that they would be
easy to implement in a quicker language and possibly used to view the results
instantly at image capture. This could remove noise and give operators a
better view of a desired target and also be used to automatically keep a
target in frame, automating another step of manual labour. This however
require a very high robustness for different environments and targets to be
quite useful.
Appendix A
Dual tensor, N
g
e k used in
This section will describe the algorithm for creating the dual tensors N
quadrature filtering. Since the theory is quite heavy we will only be presenting
the algorithm for the ordinary 2D case and any number of filter directions. Recall
the filter direction vector n̂ from the section on quadrature filtering at page 27.
We will use k filters and as such the filter directions n̂k ∈ [0, π)
Begin by taking the outer product of each direction vector to create the matrices
Nk
Nk = n̂k n̂Tk
Then reshape the Nk matrices into one larger one with k rows and 3 columns1


N1 (1, 1) N1 (1, 2) N1 (2, 2)


N =  N2 (1, 1) N2 (1, 2) N2 (2, 2) 
..
..
..
.
.
.
Then we must account for the removed data by multiplying column 2 with
this preserves the norm.
√


N1 (1, 1) √2N1 (1, 2) N1 (2, 2)

2N2 (1, 2) N2 (2, 2) 
N =  N2 (1, 1)

..
..
..
.
.
.
e
now we create F
√
2,
e = N(NT N)−1
F
e is related to N
e by the same reshape and rescaling steps performed above in
F
reverse i.e.
√


e 1 (1, 1)
e 1 (1, 2) N
e 1 (2, 2)
2N
N
√
e
e 2 (1, 2) N
e 2 (2, 2) 
e =
2N
F
 N2 (1, 1)

..
..
..
.
.
.
1 Since matrices produced by outer products necessarily are symmetric and directional vectors
have 2 elements the outer product will only have 3 unique elements
61
Bibliography
[1] Cecie Starr. Biology: Concepts and Applications. Thomson Brooks/Cole,
2005. ISBN 053446226X.
[2] James Robert Mahan. Radiation heat transfer: a statistical approach. John
Wiley & Sons, New York, 2001.
[3] J. Murray. The acadamy, 1878.
[4] Chris Harris & Mike Stephens. A combined corner and edge detector, 1988.
[5] Vibha L, Chetana Hegde, P Deepa Shenoy, Venugopal K R, and L M Patnaik. Dynamic object detection, tracking and counting in video streams for
multimedia mining. Technical report, Bangalore University, 2008.
[6] Milan Sonka, Vaclav Hlavac, and Roger Boyle. Image Processing, Analysis,
and Machine Vision. Thomson, 1 edition, 2007. ISBN 0-495-24438-4.
[7] A. M. Baumberg & D.C. Hogg. An efficient method for contour tracking using
active shape models. In IEEE Computer Society, 1994. In Proceeding of the
Workshop on Motion of Nonrigid and Articulated Objects.
[8] N. J. B. McFarlane & C. P. Schofield. Segmentation and tracking of piglets
in images, Machine Vision and Applications, 1995.
[9] Cris Stauffer & W.E.L. Grimson. Adaptive background mixture models for
real-time tracking, 1999. IEEE Conf. on Computer Vision and Pattern Recognition.
[10] Hans
Knutsson.
EDUPACK
LiU.CVL.ORIENTATION.
http://www.cvl.isy.liu.se/education/tutorials/edupackorientation/orientation.pdf, 2009.
[11] Gösta H Granlund & Hans Knutsson. Signal Processing in Computer Vision.
Kluwer Academic Publishers, 1995.
[12] Hanan Samet & Markku Tamminen. Efficient component labeling of images
of arbitrary dimension represented by linear bintrees, 1988.
[13] Cheng Y. Mean shift, mode seeking and clustering. IEEE Transactions on
Pattern Analysis and Machine Intelligence, pages 603–619, 1995.
62
Bibliography
63
[14] Fukunaga K & Hostetler L. The estimation of gradient of a density function,
with applications in pattern recognition. IEEE Transactions on Information
Theory, pages 32–40, 1975.
Upphovsrätt
Detta dokument hålls tillgängligt på Internet — eller dess framtida ersättare —
under 25 år från publiceringsdatum under förutsättning att inga extraordinära
omständigheter uppstår.
Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,
skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en
senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ art.
Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman
i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form
eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller
konstnärliga anseende eller egenart.
För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/
Copyright
The publishers will keep this document online on the Internet — or its possible replacement — for a period of 25 years from the date of publication barring
exceptional circumstances.
The online availability of the document implies a permanent permission for
anyone to read, to download, to print out single copies for his/her own use and
to use it unchanged for any non-commercial research and educational purpose.
Subsequent transfers of copyright cannot revoke this permission. All other uses of
the document are conditional on the consent of the copyright owner. The publisher
has taken technical and administrative measures to assure authenticity, security
and accessibility.
According to intellectual property law the author has the right to be mentioned
when his/her work is accessed as described above and to be protected against
infringement.
For additional information about the Linköping University Electronic Press
and its procedures for publication and for assurance of document integrity, please
refer to its www home page: http://www.ep.liu.se/
c Sebastian Möller
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Related manuals

Download PDF

advertisement