Institutionen för systemteknik Department of Electrical Engineering Examensarbete Image Segmentation and Target Tracking using Computer Vision Examensarbete utfört i Bildbehandling vid Tekniska högskolan i Linköping av Sebastian Möller LiTH-ISY-EX--11/4424--SE Linköping 2011 Department of Electrical Engineering Linköpings universitet SE-581 83 Linköping, Sweden Linköpings tekniska högskola Linköpings universitet 581 83 Linköping Image Segmentation and Target Tracking using Computer Vision Examensarbete utfört i Bildbehandling vid Tekniska högskolan i Linköping av Sebastian Möller LiTH-ISY-EX--11/4424--SE Handledare: Klas Nordberg ISY, Linköpings universitet Thomas Svensson FOI, Totalförsvarets forskningsinstitut Examinator: Klas Nordberg ISY, Linköpings universitet Linköping, 9 May, 2011 Avdelning, Institution Division, Department Datum Date Computer Vision Laboratory Department of Electrical Engineering Linköpings universitet SE-581 83 Linköping, Sweden Språk Language Rapporttyp Report category ISBN Svenska/Swedish Licentiatavhandling ISRN Engelska/English Examensarbete C-uppsats D-uppsats Övrig rapport 2011-05-09 — LiTH-ISY-EX--11/4424--SE Serietitel och serienummer ISSN Title of series, numbering — URL för elektronisk version http://www.ep.liu.se Titel Title Bildsegmentering samt målföljning med hjälp av datorseende Image Segmentation and Target Tracking using Computer Vision Författare Sebastian Möller Author Sammanfattning Abstract In this master thesis the possibility of detecting and tracking objects in multispectral infrared video sequences is investigated. The current method with fix-sized rectangles have significant disadvantages. These disadvantages will be solved using image segmentation to estimate the shape of the object. The result of the image segmentation is used to determine the infrared contrast of the object. Our results show how some objects will give very good segmentation, tracking as well as shape detection. The objects that perform best are the flares and countermeasures. But especially helicopters seen from the side, with significant movements, is better detected with our method. The motion of the object is very important since movement is the main component in successful shape detection. This is so because helicopters are much colder than flares and engines. Detecting the presence and position of moving objects is easier and can be done quite successfully even with helicopters. But using structure tensors we can also detect the presence and estimate the position for stationary objects. Nyckelord Keywords Tracking, IR, Computer Vision, Machine vision, Segmentation, Quadrature filters, Background model, Frame differences Abstract In this master thesis the possibility of detecting and tracking objects in multispectral infrared video sequences is investigated. The current method with fix-sized rectangles have significant disadvantages. These disadvantages will be solved using image segmentation to estimate the shape of the object. The result of the image segmentation is used to determine the infrared contrast of the object. Our results show how some objects will give very good segmentation, tracking as well as shape detection. The objects that perform best are the flares and countermeasures. But especially helicopters seen from the side, with significant movements, is better detected with our method. The motion of the object is very important since movement is the main component in successful shape detection. This is so because helicopters are much colder than flares and engines. Detecting the presence and position of moving objects is easier and can be done quite successfully even with helicopters. But using structure tensors we can also detect the presence and estimate the position for stationary objects. Sammanfattning I detta examensarbete undersöks möjligheterna att detektera och spåra intressanta objekt i multispektrala infraröda videosekvenser. Den nuvarande metoden, som använder sig av rektanglar med fix storlek, har sina nackdelar. Dessa nackdelar kommer att lösas med hjälp av bildsegmentering för att uppskatta formen på önskade mål. Utöver detektering och spårning försöker vi också att hitta formen och konturen för intressanta objekt för att kunna använda den exaktare passformen vid kontrastberäkningar. Denna framsegmenterade kontur ersätter de gamla fixa rektanglarna som använts tidigare för att beräkna intensitetskontrasten för objekt i de infraröda våglängderna. Resultaten som presenteras visar att det för vissa objekt, som motmedel och facklor, är lättare att få fram en bra kontur samt målföljning än vad det är med helikoptrar, som var en annan önskad måltyp. De svårigheter som uppkommer med helikoptrar beror till stor del på att de är mycket svalare vilket gör att delar av helikoptern kan helt döljas i bruset från bildsensorn. För att kompensera för detta används metoder som utgår ifrån att objektet rör sig mycket i videon så att rörelsen kan användas som detekteringsparameter. Detta ger bra resultat för de videosekvenser där målet rör sig mycket i förhållande till sin storlek. v Acknowledgments I would like to thank my supervisors Thomas Svensson and David Bergström for assisting me with my work and also my supervisor and examiner Klas Nordberg and opponent David Sandberg. Also of course my sources for helping with the ongoing research in this field without whom this would not be possible. Thanks to Mikael Möller and Anneli Gottfridsson for help with proofreading this report and improving the readability. vii Contents 1 Introduction 1.1 Problem description . . . . . . 1.2 Data . . . . . . . . . . . . . . . 1.3 Glossary . . . . . . . . . . . . . 1.4 Notation . . . . . . . . . . . . . 1.5 Tensors . . . . . . . . . . . . . 1.5.1 Eigenvalues for Tensors . . . . . . 1 1 2 4 6 6 6 2 Preprocessing 2.1 Spectral mixer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Median filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 13 17 3 Segmentation 3.1 Frame differences . . . . . . . . . . . . . 3.1.1 Algorithm . . . . . . . . . . . . . 3.1.2 Results . . . . . . . . . . . . . . 3.2 Background model . . . . . . . . . . . . 3.2.1 Algorithm . . . . . . . . . . . . . 3.2.2 Results . . . . . . . . . . . . . . 3.3 Structure tensor with image gradients . 3.3.1 Algorithm . . . . . . . . . . . . . 3.3.2 Results . . . . . . . . . . . . . . 3.4 Structure tensor with Quadrature filters 3.4.1 Algorithm . . . . . . . . . . . . . 3.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 21 22 23 23 23 24 24 26 27 27 29 30 4 Post processing 4.1 Threshold . . . . . . . 4.2 Area filter . . . . . . . 4.3 Point representation . 4.3.1 Algorithm . . . 4.4 Mean Shift Clustering 4.4.1 Algorithm . . . 4.4.2 Results . . . . 4.5 Convex hull . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 31 32 32 32 34 34 35 35 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Contents 4.6 4.7 Tracking . . . . . . . . . . . . . . . . . . 4.6.1 Tracking to reduce false positives 4.6.2 Method . . . . . . . . . . . . . . Quality measures . . . . . . . . . . . . . 4.7.1 Image variance . . . . . . . . . . 4.7.2 Object count . . . . . . . . . . . 4.7.3 Mean motion . . . . . . . . . . . 4.7.4 Area . . . . . . . . . . . . . . . . 5 Results 5.1 Segmentation . . . 5.2 Evaluation . . . . . 5.2.1 Comparison 5.2.2 Comparison 5.2.3 Comparison 5.2.4 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 35 35 38 38 38 39 40 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . of methods I, III and IV . of methods II, III and IV of methods I and II . . . of methods III and IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 43 45 46 50 53 56 6 Further work 59 e A Dual tensor, N 61 Bibliography 62 Chapter 1 Introduction This master thesis will cover methods for segmentation and tracking in multispectral infrared images. Methods used are structure tensors based on image gradients or quadrature filters; We also tested the method of background modelling with either median estimations or frame differences. These methods were used to create a framework that, as robustly as possible, detects objects of interest, with minimal operator intervention. The paper will in chapter 1 start with a data description section. Chapters 2 to 4 describe the methods mentioned above and their ability to solve the problems of object detection and tracking along with some quality measures, used to estimate the quality of the results achieved. In chapter 5 we report the combinations of methods that gave the best results for our data and their corresponding quality measures. The last chapter contains some topics for future research. 1.1 Problem description The Swedish Defence Research Agency (FOI) wants to detect targets in multispectral infrared images. This is to do analysis of military targets and the primary interest is in infrared radiant intensity contrast which is used to create an IR signature of the target. The IR signature in turn can be used to determine if countermeasures look similar to the vehicle they are protecting, similar IR signatures improve the probability of successfully evading a threat. The infra-red contrast is calculated by taking the difference of the average intensity, of the found object, and the immediate background intensity and then multiplying this difference with the area of the object. To do this we need to be able to automatically track areas such as vehicles and flares. The current method, known as peak filter at FOI, is to place a fixed size rectangle so that the sum of all pixel intensities inside its area is as large as possible. This method has some disadvantages, where the fixed-sized rectangle is the major issue. Objects generally are not rectangular and also change in size due to rotation, changing temperature or speed. The rectangle method also has problems with false positives and poor tracking results where the rectangle may stick to a spurious pixel or not follow 1 2 Introduction the target well. These disadvantages are currently fixed by operator intervention which is a time consuming solution. The report describe methods, for tracking and segmentation, that will perform better than the older methods. These methods solve the problem of the fixed-size rectangles by replacing them with a flexible contour of the same size and shape as the object. For known objects there are preconfigured combinations of methods and with new objects it is possible to create new combinations. The aim of this report is to present how to estimate a mask, from data, that can be provided to the operator. An example of an estimated mask is seen in figure 1.2b where white are the sections of the image, 1.2a, that is to be kept and black are the sections to be removed and the result is seen in figure 1.2c. To summarise: Our aim is to separate the object(s) in an image, from the background, and to estimate a mask(s) that can be used for further image processing. The system design to complete the tasks set up so far in the introduction is summarized in the figure 1.1. As seen there the design is a pipes & filters design1 to allow for parallel processing and also to make each unit independent of each other. Also the filters can be combined in a multitude of ways and some intriguing possibilities shows up when we allow for feedback of image data. Figure 1.1. A system summary showing the flow of data and the pipes & filters structure of the system design. 1.2 Data With a multi-spectral infrared camera, sensitive for wavelengths between 1.5µm to 5.5µm, it is possible to detect objects that are not visible for the human eye (a human eye is sensitive to wavelengths between 0.39µm and 0.75µm [1]). Our camera uses a number of different spectral bands (n), usually between 3 or 4, to be able to detect different objects. This means that we can, with the help of the camera, see the thermal radiation of objects colder than the 525◦ C2 which is the lowest temperature that is visible for the human eye (in optimal conditions). This 1 Pipes & filters are also known as pipeline and is a system design where the data flow through pipes and encounters different filters on the path. This design pattern allow for concurrent processing and easy maintenance since the filters are well separated. 2 Also known as the Draper point [2][3] 1.2 Data 3 Figure 1.2. original image, estimated mask and result after applying the mask. applies to “black” objects that do not reflect any light but in reality there are reflections in the infrared light that can both hide and reveal objects. The acquired data is stored in a hypercube where each pixel is described by a function f (x, y, t, i). Here x and y are image coordinates, t is time and i ∈ {1, . . . , n} is one of the spectral bands, where n is the number of spectral bands available. The function f return a brightness value for the specified (x, y, t, i), if i and t are constant but x and y varies we will retrieve an image at time t for spectral band i. The datasets presented in figures 1.3 to 1.6 are of the types that are commonly analysed at FOI and where automation could potentially save many man-hours on that analysis. In these images we can also see some typical properties of IR image sensors. Due to their very high sensitivity they have a much higher problem with “veils” that distort the background brightness level in the images. This can be seen as a slow gradient across the image with a darker side and a brighter side. Related to this is also the much lower signal to noise level that makes the Gaussian noise in the image more visible than in most visible light cameras. It is also more difficult to create error free sensors with the lower volumes they are produced in which means that the chips are more likely to have dead/spurious pixels which is seen in the image as salt & pepper noise. Luckily we can see progress in the image quality over time since we have access to both new and old data. The older image sensor produces data as seen in figure 1.3 where as the newer sensor produces all other images. 4 Introduction figure 1.3 figure 1.4 figure 1.5 figure 1.6 1.3 This is the first dataset and is a test for how well any method can handle dead pixels and noisy data. As seen in the figure, the real signal is very similar to the noise from the image sensor. This dataset is more recent than that in figure 1.3 and, as seen, much of the noise have been removed with an improved image sensor. The helicopter datasets show how similar a helicopter hull is to the sky in the infra-red wavelength regions. This dataset is a difficult test on how well the methods handle weak but spatially large signals. The last dataset type is the chaff data where many small signals is scattered in a cloud that has to be accurately tracked without too much false positives and witout missing real signals. Glossary Dead pixels is pixels that give values that are unrelated to the image data. These can usually be seen as salt & pepper noise in unfiltered example pictures. They originate from damaged sectors on the image sensor chips. Spurious pixels is pixels that like dead pixels do not contain data from the image with one distinction. Spurious pixels vary over time. Mask is an binary image that contain the shape of the detected object as well as a determination if a portion of the image is part of the object or not. An example of a mask can be seen in figure 1.2. False positives are targets that the algorithms have found that does not actually exist. These usually originate from dead pixels or noisy data. Post processing and the quality measures are an important part in removing and identifying false positives in the processed data. False negatives are target that the algorithm did not detect, in this text simply referred to as missed targets. IR signature is a parameter that describe how a target is registered in the IR wavelengths. Chaff is a type of counter measure consisting of many small pieces of conductive foil. An example can be seen in figure 1.6. Flare is a hot burning piece of phosphorus with additives to make the IR signature as similar as the defended target as possible. An example of this data can be seen in figure 1.4. 1.3 Glossary 5 Figure 1.3. A sample of a dataset using an older camera as well as a longer distance, also a flare. Figure 1.4. A sample of a dataset from the better camera taken of a flare dropped from an aircraft. Figure 1.5. A sample of a dataset taken of a helicopter. Figure 1.6. A sample of a dataset taken on chaff dropped from an aircraft. 6 Introduction 1.4 Notation Due to the many different algorithms presented, sourced from different publications, there are some notational difficulties. However I try to be as consistent as possible. In this paper the following notation will be used: x, c x, u |u| , kxk2 kxk1 û F, M F ∗ ∂ ∂x ∇ T, N kTk e B e N, F, F −1 I E O 1.5 Scalars Vectors pP 2 Magnitude of vector, |x| =P i xi 1-norm of vector, kxk1 = i |xi | Directional vector of unit length Scalar valued functions on R2 the complement of image F . i.e 1 − F for normalized images Convolution smooth partial differentiation in x the coordinate ∂ T ∂ , ∂y ) smooth gradient ( ∂x Matrices and tensors Norm of matrices and tensors Dual tensors Fourier transform and it’s inverse Input image to the segmentation step Segmentation output image Post processing output image, mask Tensors In this paper the term tensor is a symmetric 2 × 2 matrix for each pixel where the four elements represents the local region. There is many ways to compute the tensor for images and the most common example is the structure tensor which is generated by calculating the gradient of the image and then taking the outer product of the gradient at each pixel as seen in equation 1.1 and figure 1.8. ! 2 Tg = ∇I ∇I T ∗ g = ∂I ∂x2 ∂I ∂I ∂x ∂y ∂I ∂I ∂x ∂y ∂I 2 ∂x2 ∗g (1.1) Where g is a gauss window for local mean estimation. When working with tensors all pixels will have four data values assigned to each one of them and in this thesis the values have been separated so that all values in position (1, 1) for all pixels is stored in the upper left quadrant and position (1, 2) in the upper right quadrant and so on. This can be seen in the figures 1.8 and 1.9 where the structure tensor have been estimated for the image in figure 1.7. 1.5.1 Eigenvalues for Tensors For tensors the eigenvalues λ1 and λ2 represent the intrinsic dimensionality of the region around the pixel in question. We arrange the eigenvalues so that λ1 ≥ λ2 . 1.5 Tensors 7 Figure 1.7. IR image of a helicopter seen from the side. This picture is used for the tensor calculation examples in figures 1.8 to 1.11 Now if λ1 and λ2 are small then the region around the pixel is a featureless flat region, this case is called I0D or Intrinsically zero dimensional. If however λ1 > λ2 then the region contains a one dimensional feature i.e. a line. If both λ1 and λ2 are large then the region has a two dimensional feature such as a corner or two lines crossing. To summarize: I0D λ1 = λ2 = λ0 Planes, featureless areas I1D λ1 > λ2 = 0 Lines and edges I2D λ1 ≥ λ2 > 0 Crossing lines and corners The orientation of these features is determined by the respective eigenvector v1 and v2 . It is however important to remember with which parameters the tensor was estimated since it has an impact on which sized features that will be detected. Algorithm Recall: T= a b b d The eigenvalues of T are straightforward to compute by the following equation. λ= p 1 a + d ± a2 − 2ad + 4b2 + d2 2 (1.2) 8 Introduction Figure 1.8. Example of the structure tensor Tg from image gradients. Note that salt & pepper noise gives strong signals as well as the true target, the helicopter. Figure 1.9. Example of the tensor Tq from quadrature filters. Note that this is a low frequency filter and that only larger objects is seen. 1.5 Tensors 9 There is however 2 different cases for the eigenvectors. First case when b 6= 0 and second case when b = 0, see table below. b 6= 0 λ1 − d b 1 0 b d − λ2 0 1 v1 v2 b=0 However we do not need to explicitly calculate the eigenvalues for the image to be able to estimate the intrinsic dimensionality of the neighbourhoods of all pixel’s. We can save the computation of a square root and still get good results by using the Harris operator [4], R(T) = det(T) − κ trace2 (T) = ad − b2 − κ(a + d)2 = λ1 λ2 − κ(λ1 + λ2 )2 where κ is a tunable parameter usually in the interval 0.04 to 0.15. 10 Introduction Figure 1.10. |λ1 | and |λ2 | of Tg from the tensor examples 1.5 Tensors 11 Figure 1.11. |λ1 | and |λ2 | of Tq from the tensor examples Chapter 2 Preprocessing In this first step we will change the input, a multi-spectral image, f into an grey-scale image, I. Thus most important preprocessing step is to reduce the amount of data by combining the brightness, bi = f (x, y, t, i), of spectral bands 1, . . . , n to simplify the calculations. That is, to combine the brightness values b1 , . . . , bn so that the pixels (x, y, t, b1 ), . . . , (x, y, t, bn ) are mapped to a pixel (x, y, t, g(b1 , . . . , bn )) for all (x, y, t). 2.1 Spectral mixer The reason we need the transformation g is that the image segmentation methods we will use must have a scalar brightness, not a vector one. The transformations we compare are the following three: b̄1 + · · · + b̄n n median(b̄1 , . . . , b̄n ) and b̄k̂ and where k̂ is the index for the brightness value given by max(|b1 |, . . . , |bn |). Now the bi ’s for spectral band i and time t are pre-transformed as b̄i = bi − µti where we define XX µti = x f (x, y, t, i) y XX x (2.1) 1 y 1 as the mean of bi over x and y. This will reduce the complexity of having several spectral bands with different brightnesses and allow the methods to work with the more relevant adjusted brightness, b̄. The pictures in figure 2.1 show the 1 Medians have also been evaluated with similar results but with much longer calculation times. 13 14 Preprocessing bi −µi −→ Figure 2.1. The left image shows µit for different times and for three spectral bands. The right image shows the brightness b̄i after adjusting for the mean value. adjustment visually, note that the left image is in a different scale than the right one. 2.1 Spectral mixer 15 In figure 2.4 the result of different g’s is shown. The g settled for was absolute maximum since it selects the bk̂ which have the greatest deviation from the mean, zero. If we assume that brightness is roughly normal distributed (see figure 2.2) then values in the tails should more likely be associated with an object. Figure 2.2. Image histogram of a single frame showing roughly normal distributed brightness values. The following property of normal distributions still holds, it is more common the be close to the mean than further away. This method, however, is not wholly satisfactory and results might be improved if the method incorporated the actual wavelength, for each band, and the bands were combined together in a more photometrically correct way. That is if we could utilize the information in each spectral band to assign a temperature value to each pixel, determined from the black body radiation assumption, this would require some data from our camera filters and knowledge about black body radiation physics. It is however not necessary since this simple and quick method gives satisfactory results. Figure 2.4. The resulting images after mixing the adjusted bands with mean, median and absolute maximum respectively. Figure 2.3. A small cut-out of the original data with four spectral bands. 16 Preprocessing 2.2 Median filtering 2.2 17 Median filtering Further preprocessing using median filtering can be useful to reduce salt & pepper noise and spurious pixels, with variations in brightness, caused by camera and filter errors. Since these pixels can cause false positives it is advantageous to filter them out. Disadvantages of median filtering is that it round sharp corners as seen in figure 2.6 and may remove details of the detected shape. Furthermore since median filtering removes spurious pixels these cannot be detected later and may therefore falsely be included in detected object shapes. This can be seen in figure 2.8 and if we compare to figure 2.7 where the dead pixels have been removed from the mask. This will give bias to the contrast calculations and is therefore a very significant drawback. The following figures will give an artificial example of this bias error in mean brightness, µ. Mean brightness calculations with a mask is very similar to equation 2.1 XX F (x, y) · M (x, y) µ= x y XX x y where F is the image and M is the mask. (2.2) M (x, y) 18 Preprocessing A test image, F , with salt & pepper noise and a square object in the middle is created as seen in figure 2.5. To the right, in figure 2.6, is the same image F 0 after median filtering where most noise is removed and the sharp corners of the square are smoothed. Figure 2.5. Original image, F Figure 2.6. Filtered image, F 0 In the figure 2.7 is the optimal segmentation2 result, M , and in figure 2.8 the median filtering segmentation result, M 0 . Note the black spots in the optimal segmentation which indicate pixels that we do not have data for since they where damaged. Figure 2.7. Optimal mask, M 2 Created Figure 2.8. Filtered mask, M 0 by hand to be the best possible result. 2.2 Median filtering 19 In figure 2.9 the foreground, F · M 0 of the image,F , using the median filtering and segmentation result is seen. In figure 2.10 the background, F · M 0 is shown. The object’s estimated mean brightness, µf , which will be used in contrast calculations, is calculated as P P 0 x yF ·M µf = P P 0 x yM and is in these images 1795, compared to 1000 which is the ground truth. Figure 2.9. Foreground, F · M 0 Figure 2.10. Background, F · M 0 In the figures 2.11 and 2.12 the foreground and background of the image is calculated using the optimal segmentation. With this segmentation the mean brightness, µo , is calculated as P P x yF ·M µo = P P x yM and is 1043, which is much closer to the ground truth of 1000. Figure 2.11. Foreground, F · M Figure 2.12. Background, F · M Chapter 3 Segmentation In chapter 3 we will discuss two methods for background removal: Background models and frame differences. We will also look at two methods for object detection: Structure tensors from image gradients and structure tensors from quadrature filtering. The background removal is a way to split the image into two segments: Background and foreground. The foreground is expected to be the desired target. Object detection may detect the presence and position of a target but not the shape. These methods will all take a grey-scale image, I, as input and produce an interest estimation image, E. In this interest estimation image, which is also an gray-scale image, a brighter area is more likely to be of interest than a dark one. Each method gets a section where its implementation, strengths and drawbacks, are discussed. We need to discuss at least four methods because one method is not enough to solve all the different problems with background removal and object detection. The goal for these methods is to detect and separate objects such as, but not limited to, helicopters, flares and chaff. This will result in a foreground estimate describing which pixels are occupied by an object and which pixels are just background. The foreground estimation will however need some cleaning with post processing to reducing false positives and improving the results. 3.1 Frame differences Frame differences utilize that properties of objects in motion is different from stationary objects, which are less likely to be interesting, and noise. This works since taking the difference between two frames will only show the places where there have been a change. The method has significant drawbacks since it cannot handle stationary objects or objects that move slowly in relation to their size. The algorithm is taken from [5]. 21 22 Segmentation Figure 3.1. Two IR images originally multi spectral but the spectrum’s have been mixed together, the objects are a helicopter and a flare Figure 3.2. Relevant parts of output from figure 3.1 after frame difference calculations with k1 = 3, the flare is segmented very well but the helicopter is showing significant shadows from older frames. 3.1.1 Algorithm The algorithm is quite simple but there is room for extensions. We start with the following definitions: Ik = image numbered k Fk = Ik − Ik−k1 where k1 is a tunable time offset and k belong to the set of all frames, {1, . . . , n} where n is the total frame count in the dataset. The source[5] used a second frame differences step to get a temporal second derivative, which however is not used in this report. To remove spurious pixels the algorithm remove those pixels whose absolute value is larger than the user supplied parameter max Ek = Fk , if |Fk | < max 0, if |Fk | > max 3.2 Background model 3.1.2 23 Results For objects moving fast relative to their size this method performs well and, as seen in the result image in figure 3.2, most of the noise from figure 3.1 is stationary and is filtered away quite nicely. The motion requirement makes this method, by itself, less useful for larger objects. This is because such objects will cast “shadows” on other parts. This disadvantage makes the segmentation results unreliable for large objects. 3.2 Background model This method is taken from [6], original source is [7], but with an added trick also suggested in the book [6] but sourced from [8], with incrementally updating a median estimation. This median estimation is used to estimate the foreground by subtracting the background estimate. This method will work best for stationary cameras or when we have a homogeneous background with a moving target. For good data the method is also very easy to tune and can be quite fast if a small number of images, k, is used. 3.2.1 Algorithm There are two tuning parameters: The first one, β, is the number of frames used for each estimate and the second one, α, is how fast the background will be updated. Let as before {Ik }nk=1 denote our consecutive frames. The algorithm works as follows 1. initialize k = β 2. estimate the background Bk = median (Ik−β+1 , Ik−β+2 , . . . , Ik ) 3. increment k by one 4. estimate our foreground Ek = |Ik − Bk−1 | 5. update the background Bk = (1 − α)Bk−1 + α median (Ik−β+1 , Ik−β+2 , . . . , Ik ) 6. if k < n return to item 2. Another method is to incorporate the standard deviation for each pixel in the background. This method adds two steps to the algorithm above, similar to the method by [9] but with a single Gaussian distribution per pixel instead of several. 1. initialize k = β 24 Segmentation 2. estimate the background µk = median (Ik−β+1 , Ik−β+2 , . . . , Ik ) 3. estimate the standard deviation σk = std (Ik−β+1 , Ik−β+2 , . . . , Ik ) 4. increment k by one 5. estimate our foreground Ik − µk−1 Ek = σk−1 6. update the background µk = (1 − α)µk−1 + α median (Ik−β+1 , Ik−β+2 , . . . , Ik ) 7. update the standard deviation σk = (1 − α)σk−1 + α std (Ik−β+1 , Ik−β+2 , . . . , Ik ) 8. if k < n return to item 2. 3.2.2 Results Overall this background model supersedes the frame difference model in detecting objects as well as having less false positives. It is however slower which, while not a concern for this implementation, could be of interest in future development. For sequences where the target is moving around in the frame the segmentation can be very good as seen in figure 3.3. For sequences where the camera tracks the object well we get the side effect of swapping what is foreground and background, i.e. the helicopter is believed to be the background and the sky the foreground. In figure 3.4 there is an example of the poor segmentation that is the result. 3.3 Structure tensor with image gradients This is a method for estimating the structure tensor 1 for the neighbourhood at each pixel. The method can be tuned for object size but, to save calculation time, for large objects scaling down the image is recommended. 1 for a short explanation of the concept of tensors see the introduction 1.5 3.3 Structure tensor with image gradients 25 Figure 3.3. Here are the original image, the background estimate, the difference and the estimated shape of the helicopter. As seen the resulting shape is very close to the actual shape in the original image. Figure 3.4. A video sequence where some pixels are poorly segmented because the helicopter occupy them very often. The artefacts that cause the problem is clearly seen in the background estimation. 26 Segmentation Figure 3.5. On top the two functions g and gx respectively that when convolved will produce the bottom convolution kernel, g T ∗gx , that is used when estimating a continuous derivative over an image. 3.3.1 Algorithm We will estimate the image derivatives by convolving the image with a 2D signed Gaussian kernel, created from two separable 1D kernels, which will be cut off in the spatial domain at ±3 standard deviations, which is sufficiently large, where the standard deviation is tunable so that we can change how the filter will react to different sized objects. Large variance is used to diminish the response from small objects and amplify the response from large objects since a larger neighbourhood is taken into consideration with larger variance. First construct the separable filter kernels given the variance σ and extract values that are within 3 standard deviations to get a limited size, pictured on top in figure 3.5. −0.5x.2 e σ2 g(x) = √ σ 2π gx (x) = −gxσ 2 The Gaussian kernels are used to estimate continuous derivatives on the discrete 3.4 Structure tensor with Quadrature filters 27 data by convolutions: ∂I = I ∗ g T ∗ gx ∂x ∂I = I ∗ g ∗ gxT = I ∗ (g T ∗ gx )T ∂y There is an example of there operations in figures 3.6 and 3.7. derivatives are used to estimate the structure tensor Tg : Tg = ∇I ∇I T ∗ g = ∂I 2 ∂x2 ∂I ∂I ∂x ∂y ∂I ∂I ∂x ∂y ∂I 2 ∂x2 The image ! ∗g Here we will use the structure tensor to estimate the local intrinsic dimensionality of the image, i.e. the local complexity of the image. A more complex part of the image is deemed more likely to be interesting to track. This is done by calculating the eigenvalues of the tensor or simpler by using the Harris operator. Both of these concepts are further explained in the introduction 1.5. 3.3.2 Results The method of structure tensors with image gradients gives good results in detecting objects that are spatially large since the desired target size can be tuned to ignore the much smaller salt & pepper noise. However it will not perform well if the desired target is of similar size as the noise. The method detects the position and presence of the object, it is especially useful if the proper shape cannot reliably be estimated. The reason why the method cannot detect the shape of the object depends on that the edges around the object is of similar frequencies as the noise. This disadvantage makes the method a poor choice for image segmentation but it is still useful when other image segmentation methods fail to acquire an estimate for position and presence. The conclusion is that the method is more robust concerning position and presence detection. 3.4 Structure tensor with Quadrature filters Most of the quadrature filters implementation is sourced from [10] where many methods for estimating local orientation is described but the original source is [11]. The result from this method is an estimation of the structure tensor T of the local neighbourhood for each pixel. From this structure tensor we can calculate eigenvalues which describe the local intrinsic dimensionality and the norm of the tensor. The quadrature filter has a tunable central frequency which is used to filter out object of undesired spatial sizes, which can be very useful when we want to find objects that are large in a noisy image but where the noisy pixels are small, such as salt & pepper noise. The method is very similar to the image gradients with structure tensor method, but with a different way of estimating the tensor. 28 Segmentation Figure 3.6. Image used for image derivatives example, note the multiple engines of the aircraft as well as some salt & pepper noise which all give strong signals Figure 3.7. Image derivatives from the test image in figure 3.6, Ix0 and Iy0 respectively note that the signal from the engines is stronger than that of the salt & pepper noise because of the large variance σ in g. Unfortunately there is some edge effects visible that can cause false detections if they are not filtered away. 3.4 Structure tensor with Quadrature filters 3.4.1 29 Algorithm The theory about quadrature filtering of images is best explained in [10] but a full account for it is too much for this text. However the algorithm can be used and implemented from the following description. The algorithm begins by obtaining Ri , central frequency, and B, bandwidth, from the user where Ri determines the size of desired objects and B the variance within the size. These parameters are problem specific, with different values the filter has different properties, for example with large Ri salt & pepper noise is very strong and needs to be removed in preprocessing. After getting the parameters from the user the filters Hn are created in the digital Fourier domain with the same size as the frame to be filtered. This method uses a tunable count of filter orientations n where the filter orientations represents the direction of interest and lines orthogonal to this direction will give stronger responses. Since −n̂ and n̂ will result in the same filter response magnitude, only directions from [0, π) is 2 considered and as such the directions {0, π4 , π2 , 3π 4 } are selected and filter direction vectors n̂ generated of unit length. The quadrature filter Hn is defined from two functions R and D, a radially symmetric function and a directional function respectively: Hn (u) = R(|u|) · D(û) where u is the 2D frequency coordinate vector and R and D are defined as: −4 ln2 ( |u| ) Ri R(u) = e B2 ln 2 2 (û · n̂) if û · n̂ > 0 D(u) = 0 if û · n̂ < 0 Then for all frames I we apply the following equation: qn = |F −1 (F(I) · Hn (u)| (3.1) where qn is the filter response for filter direction n, this is then used to estimate the tensor Tq . X e n · qn Tq = N n e is a set of dual tensors, where [10] explains how they are determined in the N general case, a recipe is provided in appendix A, and for the set of directions e n is: picked here N e1 = N e3 = N 3/4 0 −1/4 0 0 −1/4 0 3/4 1/4 1/2 1/4 −1/2 e2 = N e4 = N 1/2 1/4 −1/2 1/4 2 The number of directions is tunable and more directions will get slightly better results at a trade off for longer calculation times 30 3.4.2 Segmentation Results Salt & pepper noise gives very strong responses when B and Ri are tuned for small objects much like the results for the gradients method. As seen in the result image in figure 3.8 large objects can be tracked satisfactory but as with the image gradients the contour of the object must be extracted with another method. There is another problem that arises from the digital Fourier transforms, object close to the edges will give strong filter responses on the other side of the image where there is no target due to the cyclic nature of the transform. This problem can be solved by filtering in the spatial domain i.e. replacing the equation 3.1 with: qn = |I ∗ F −1 (Hn )| Figure 3.8. After filtering the images on top, with quadrature filters, we get the images below. The parameters where B = 2.2 Ri = 2.2 for the helicopter image and B = 5 Ri = 170 for the point object image Chapter 4 Post processing The segmentation algorithms presented, in the previous chapter, all end up with an grey scale image, E, where white signify foreground and black signify background. This is not good enough for obtaining good object shapes since there is noise present and the desired output, O, is an binary mask with only two values, therefore further processing of the shapes is necessary. First a threshold is applied to the foreground estimate so that we get a boolean value that determines if a pixel is part of an object or not. Then we use different filtering methods, depending on the data at hand, to further process the image. Finally we estimate the quality of our results. 4.1 Threshold The problem of setting a proper threshold is very important since most algorithms in the end will need a conclusive answer and the only way to get it is to determine what does and does not pass. There are several methods implemented that are usable in different scenarios. The simplest method is to give an interval for the brightness Exy of pixel (x, y). Each pixel with a brightness inside the interval (tl , th ) are turned on and the others are turned off, 1 tl < Exy < th , Oxy = 0 otherwise. This method has the advantage that if the target moves out of the frame we will not detect any new targets. But the down side is that the user must supply good values for tl and th . A more robust method for producing an binary image is following threshold function, 1 tl (maxx,y (E) − µ) ≤ Exy + µ ≤ th (maxx,y (E) − µ) Oxy = (4.1) 0 otherwise where the parameter µ = mean(E). 31 32 Post processing The choice for tl and th is not as sensitive in this latter method as in the previous one. The downside with this method is that it will usually find new false targets if the real target moves out of the frame since when the target is outside the frame at least the mean value drops. Another method that have also be attempted is to use the standard deviation for selecting the brightest target. In this case we have 1 tl σ + µ < Exy < th σ + µ Oxy = 0 otherwise where σ = std(I). This method has no advantages over the previous one in 4.1. 4.2 Area filter After having done the threshold filtering we may use area filtering to remove objects that have too large or too small area. Since such objects either are noise or, for example, a cloud. We first label the 8-connected objects in the image using the algorithm from [12]. With labeling complete we have a number of separate objects for which we count the number of pixels belonging to each separate object. Then we remove any object which has too few or too many pixels. This method is very efficient because it disposes of the most common false objects, those that are smaller than the desired target size as well as those that are larger. 4.3 Point representation The detected objects should get a singular position to assist tracking. Centre of mass is one method for this that works well and is fast both to implement and to calculate. Another method is binary shrinking1 of areas until only one single pixel of each object remains, see the example in the algorithm section. The centre of mass method has the advantage of speed whereas binary shrinking can separate two objects that might have been too close for the segmentation methods to separate. The prerequisite for successful separation is that the binary image is concave since convex shapes, when shrunk, will end up with only one point. 4.3.1 Algorithm Both centre of mass and binary shrinking is quite straightforward to implement. For the centre of mass method we define the following variables for object Ok mxy = value at pixel (x, y) ∈ Ok rxy = position for pixel (x, y) ∈ Ok 1 Binary shrinking is also known as ultimate erosion[6]. 4.3 Point representation 33 and then define the centre of object Ok as rCM P mxy rxy = P mxy observe that the object index k has been suppressed and that mx y usually is, but does not have to be, one. The binary shrinking method is the same as the morphological method “erode” applied over and over again until each object consists of only one single pixel. This algorithm also goes under the name ultimate erosion. Below is a constructed example image with objects that have not been successfully separated. The left image shows the centre of mass for the detected objects and the right image the centre of mass after binary shrinking. The intermediary steps in the algorithm is shown below with a few steps to the left and the final result in the image to the right. 34 Post processing 4.4 Mean Shift Clustering Mean shift clustering uses the mean shift algorithm from [6], with primary sources [13][14], to determine which objects belong together. This can be very useful for clouds of objects that is detected individually but create a larger object that is more interesting when all the sub-objects are taken into consideration together and at the same time removing objects that do not belong to the larger cloud2 . This is done by assigning each small object into a basin of attraction and the basin which collect the most detections is deemed the most interesting one and all detections that fall into it is kept and any detection that fall into another basin is rejected. To assign a basin of attraction to a detection we iteratively apply a kernel function K which determine the behaviour of the method. Here K(x) = exp(−0.5|x|2 ) is used. 4.4.1 Algorithm The first step is to get point representations for all objects and xi will represent the coordinates for object i. yjp is the origin of the kernel K for iteration j and object p. The two user parameters min and h where min determines when the algorithm stop, usually a small number around 1 is sufficient, and h determine the kernel bandwidth which is more difficult to tune well, for small values of h the objects must be closer together than for larger h to be in the same basin. 1. Initialize position of kernel yjp = xp . 2. Calculate a better position for the kernel by determining the centre of mass for the objects xi where the mass is the kernel response to the distance between yjp and xi , weighted with the kernel bandwidth h: p yj+1 = n X i=1 xi K |yjp − xi | h !, n X K i=1 |yjp − xi | h ! p 3. Calculate the distance d = |yj+1 − yjp | from the kernel motion and use it for a stop criteria: if d > min continue refining the kernel position by returning to item 2. 4. Now that the kernel have converged to the position yp and this will be the first basin. We assign xp to this basin and any other points that converge to the same position will belong to the same basin. Any other converge points will create new basins to collect objects in. 5. Finally select the basin with the most objects in it for the foreground estimation, all other basins are discarded. 2 For examples of this dataset see the figure 5.2.2 on page 51 4.5 Convex hull 4.4.2 35 Results This method is very useful for data which has a lot of separate parts where the constituent parts are very small as seen in figure 4.1. This situation makes it hard to use area filtering to remove false positives since they are of the same spatial size as the desired targets. But mean shift works well in detecting which targets belong together and which targets are false positives. 4.5 Convex hull For some sequences there are many targets and detecting them individually may be less useful than determining a region that span all detections. The method attempted here is to determine the convex hull of the detections that are close together. This is done by getting a distance threshold from the user, sorting all detections in a list and removing those which are too far from their neighbours. Unfortunately the results vary too much to be very useful as seen in the two example images in figure 4.2. 4.6 Tracking Tracking is used as an aid in both determining if what the segmentation has found indeed is an interesting object or not and assist in improving the segmentation results. This tracker assisted segmentation could on the first attempt find a small part of a vehicle3 . The tracker could then determine that this small part is of interest and a second iteration of segmentation could then focus on this point. 4.6.1 Tracking to reduce false positives The general idea of tracking is to determine from the behaviour of a detection whether it actually is an object. False positives are expected to either move erratically, be stationary or be short lived since sensor noise that is similar to the desired signal will not move across the frame in a smooth fashion. The short lived targets and stationary ones are both attributed to flickering pixels. Good signals are expected to move smoothly and exist for a longer duration. Tracking can also be used to determine the identity of new objects by asserting that the new object is actually another older one which have been occluded temporarily. 4.6.2 Method The current method for tracking is quite crude and could use some improvements but it performs well enough on the data at hand. The method determines which objects in a new frame is the same ones as in the previous by comparing their 3 On powered vehicles there is usually some exhaust and engine parts that is much hotter than the majority of the vehicle. 36 Post processing Figure 4.1. The result after filtration with the mean shift algorithm. False positives not situated close to the countermeasures cloud is removed leaving a clearer picture as a result. 4.6 Tracking 37 Figure 4.2. Two consecutive frames from a video where the convex hull method have been applied. As seen the results are quite unreliable for this data where the top frame has a good result the next frame includes much area that is not related to the target. This is happens too often for the method to be very useful. 38 Post processing positions. I and J is the current and previous frames respectively and in and jn represents the positions of the objects in them. We then create the matrix M: ki1 − j1 k2 ki1 − j2 k2 . . . M = ki2 − j1 k2 ki2 − j2 k2 . . . .. .. .. . . . If the two frames are similar M will be a square matrix with small values in the diagonal and larger values everywhere else. If there is different numbers of objects in J compared to I, M will not be square and we will have to select best matches first and create or remove objects that have no similar detection in the neighbouring frame. If these steps are done on all frames in a dataset we can then follow any object in the video and calculate the infrared contrast for any object in each frame. 4.7 Quality measures It is useful to estimate how well the methods have detected the objects in the scene and therefore a few quality estimation measures are usually applied to the video sequences. There are two examples of quality measures with datasets from flares and helicopters in figures 4.4 and 4.3 respectively. There are also many more in the results section with images from the datasets to accompany them. 4.7.1 Image variance A decent measure of how easy it is to obtain a good and clear mask of the foreground objects is to check the spatial standard deviation, σ, of the entire foreground estimation image: v u n u 1 X (bj − µ)2 σ=t n − 1 j=1 where n is the number of pixels in the image, µ is the mean value of all pixels in the image and bj is the intensity of pixel j. A large variation usually means that we have more leeway if we want to use a threshold and will still get a decent result. This estimation method has been used a lot and has given good results for estimating how difficult it will be for the operator to set a proper threshold and subsequently the quality of the segmented data. It is however, much like most methods, not foolproof and cannot easily be used alone to automatically determine if the segmentation is likely to succeed or not. This is because the magnitude of the variance is different for different targets and video sequences but some processing of this data might solve this problem. 4.7.2 Object count A simple count of the number of distinct objects detected at each frame can tell if the resulting segmentation has performed well or not. We often know in advance 4.7 Quality measures 39 how many interesting objects there are in a video sequence and therefore can determine which frames that have bad data in them. However there is always the risk that the detected number of objects is good but the actual objects detected may still be wrong. Algorithm The method tested are to count the number of distinct 8-connected neighbourhoods there are in each frame and suggest that this number is the same as the number of objects. More advance methods that merge close objects if they are similar, such as the smoke of a flare which should be counted as a part of the flare and not as a separate object, might get a better estimation but this method is still very useful. 4.7.3 Mean motion This method gives a simple measure of how much motion there is in the frame where too much or too little motion can give hints whether the detection is finding too much noise or that there are dead pixels that have been confused for objects. There was no sequence where this was a significant problem and as such the quality of the quality measure has not been determined. Algorithm For this method a very rough estimation method is used: • If the number of objects in frame one is the same as in frame two. • Calculate the centre of mass for all objects. • sort the coordinates, for each frame, in decreasing magnitude. i.e. (11, 13) and (15, 20) into a single, sorted, array 20, 15, 13, 11. • sum the absolute difference of each frame with its successor and divide with the number of objects. Example with two objects p1 and p2 with positions (11, 13) and (15, 20) in the first image, f1 , and (13, 15) and (15, 24) in the second, f2 : f1 f2 |f1 − f2 | 20 24 4 15 15 0 13 15 2 11 13 2 2 k1 for this data the estimated mean motion will be kf1 −f = 4+0+2+2 or 4 pixels. n 2 However note that this is not Euclidean distance, 2-norm, but Manhattan distances, 1-norm, but it is still useful since large changes in position will give large values regardless. 40 4.7.4 Post processing Area The number of on pixels in a mask can be used to determine if it is the same object that is detected in neighbouring frames since a large change in area probably means that there is a new object that is being tracked. The area measure can also be used to determine how well the shape segmentation has performed since a very small area for a helicopter would imply that only the engine has been detected. 4.7 Quality measures 41 Figure 4.3. The binary image quality measures for one of the helicopter passage video sequences. Some missing data can be seen in the movement measure due to some separation in the segmentation of the helicopter propeller. The area measure tells us that the first few frames are probably bad and at the middle of the video the helicopter passes partially out of frame. Figure 4.4. All quality measures for one video sequence with a flare dropped from an air plane. It may seem as the measures indicate a bad segmentation but much data comes from the four engines of the air plane that are strong heat sources that are clearly visible in the video sequence. Chapter 5 Results Previously we discussed image segmentation algorithms and found that they gave very different results. To summarise our results we introduce the following two concepts: Good data is when the object is moving around in a stationary or homogeneous background and preferably with a large contrast (positive or negative makes no difference). Bad data is when the brightness contrast is very low, such as when we have short wavelengths or large distances. At large distances much of the light is lost due to the inverse square law and atmospheric absorption. The segmentation is also problematic when objects are stationary or move slowly in the image. The results for good data are very satisfactory and should help in the contrast calculations at FOI. 5.1 Segmentation The segmentation results for flares are very good. The shape of the flare may be followed as accurate as the operator decides (see figure 5.1). In figure 5.1 all objects, in the frame, are detected and we can see clear signals both from the engines of the aircraft, that drops the flare, and the flare itself. Helicopters can be quite difficult to track and, for bad data, good shape segmentation was next to impossible. For good data it was quite easy to get a good shape segmentation and the results were satisfactory (see figure 5.2a). Here our quality measures from 4.7 are useful in determining how good the results are. For both helicopters and flares the same combination of methods were used but with different settings for the parameters. It is possible to get better results if we are only interested in one type of object. But it is time consuming to find that specific combination of methods that is better than the one presented here. The first combination method, I, used for flares and helicopters are: 43 44 Results Figure 5.1. A standard flare video sequence result after segmentation. All flare data have results similar to this if method I is used. Figure 5.2. pictures of both good and bad segmentations of helicopters. The bad segmentation is from a helicopter heading towards the camera and the helicopter front has a much lower temperature than the sides as well as being quite stationary in the image which makes the detection much more difficult for the motion sensitive methods. 5.2 Evaluation 45 Figure 5.3. Two frames from the helicopter sequence with only short wavelength image data used. As we can see the segmentation is significantly worse than the case with several spectrum of data. Spectral mixing → Background models → Threshold → Area filter The second combination method, II, used for chaff data is: Spectral mixing → Background models → Threshold → Mean shift The third combination method, III, is used for all data: Spectral mixing → Threshold The fourth combination method, IV, is used for all data: Spectral mixing → Peak filter The third method is introduced as a control to evaluate how much better the methods I and II are to some very simple method. The fourth method is to asses how the new methods, one and two, perform compared to the rectangle method presented in the introduction. The settings at each filtering step are quickly configured with aid of intermediary data. To evaluate the methods we use the quality measures introduced in section 4.7. At FOI there is also some interest in how well the methods will perform if the spectral bands are limited to only the short wavelengths, which gives much less contrast from the intended target and is a quite hard spectrum to track in. This was tested by removing all other spectral bands in the spectral mixing step and continue as usual after that. The results with such data is worse than if all spectra are used and, for helicopter data, only the helicopter engine can be tracked with any reasonable accuracy. When we lower the threshold value the hull may be seen in some images (see figure 5.3) but the images vary too much for the results to be reliable. However the flare data is always well segmented. 5.2 Evaluation In the following we will compare the different methods for the three data types flares, helicopters and chaff. The quality measure “image variance” is omitted 46 Results in these evaluations since it did not provide any more information beyond the quality measure “area”. It is also important to remember that method IV has the disadvantage of the fixed size rectangle that does not represent the detected objects very well. 5.2.1 Comparison of methods I, III and IV On the next 3 pages we will compare method I with the two controls methods, III and IV, on three different datasets. Set 1 The first dataset is a flare dropped behind an aircraft where the camera follows the flare. This leads to that the aircraft disappears and that the flare motion is quite jittery, since human operator adjusts the camera intermittently. This jittery motion is seen in the quality estimation “Mean movement” for method I and intermittently for method IV. Method III gives much lower values than method I and III, this is because most of the detections are stationary noise which reduce the mean movement in the image. Method I is the only one which gives the correct object count during the entire video where method III is grossly over estimating due to false positives and method IV, since it must have a fixed number of objects, has the wrong number most part of the sequence. Set 2 The second dataset is from an older camera and further away from the object. The flare, marked in the images with an ellipse, is much dimmer and smaller than in the first dataset, which makes the segmentation more difficult. In the quality measures we can see that method I mostly detect the correct number of objects where the errors are caused by hot smoke from the flares. Methods III and IV suffer from the more difficult conditions and cannot give any usable results. Set 3 The third dataset is a helicopter that moves out of frame for a duration, taken with the new camera. Here we can see that both methods I and IV estimate the motion of the helicopter rather well but with a distinction. Method IV gives false motion and false objects during the time when the helicopter is not seen but method I correctly determines that there is no object in the image at that time. For method III this dataset is very difficult since the contrast between the helicopter and the background is so low. This gives many false positives that impact the results. 5.2 Evaluation 47 Area Mean movement Object count I III IV Always 651 Always 1 48 Results Area Mean movement Object count I III IV Always 121 Always 1 5.2 Evaluation 49 Area Mean movement Object count I III IV Always 2911 Always 1 50 5.2.2 Results Comparison of methods II, III and IV When comparing method II to methods III and IV we will look at chaff data which is the dataset for which method II was designed. As seen in the object count quality measure the methods II and III performs quite similarly but with the distinction that method II removes some false positives. The similarities is to be expected since, as we can see in the dataset images, the chaff is bright on a dark background, which is the ideal case for method III. Note that method IV detects the motion of the chaff well when it is bright and tight together but has problems as soon as the chaff cools down and disperses in the air. 5.2 Evaluation 51 Area Mean movement Object count II III IV Always 3876 Always 1 52 Results Area Mean movement Object count II III IV Always 3876 Always 1 5.2 Evaluation 5.2.3 53 Comparison of methods I and II The difference between methods I and II is relevant to emphasize since they involve using similar algorithms and method II is significantly slower. On the next two pages are two datasets each utilizing the strengths and weaknesses of each method so that the differences is as clear as possible. And we can see that each method perform best at their respective designed target. On the first set, method II keeps small detections and correctly estimate the area of them all. On the second set method I successfully removes all false detections. 54 Results Area I II Mean movement Object count 5.2 Evaluation 55 Area I II Mean movement Object count 56 5.2.4 Results Comparison of methods III and IV The two control methods both give adequate results on some specific data in some specific case. Both methods however suffer greatly in the robustness criteria since they cannot handle more than good data in specific cases. The next two pages show two datasets where one method performs well and the other does not. 5.2 Evaluation 57 Area Mean movement Object count III IV Always 2911 Always 1 58 Results Area Mean movement Object count III IV Always 651 Always 1 Chapter 6 Further work There are quite a few angles that would benefit greatly from further investigation and testing: Edge based segmentation Some edge based segmentation that do not rely on motion as heavily as the current best methods would be very useful. This could make it possible to better track objects with little motion relative to their size or if there is a very cluttered or moving background.[6] Improved and more robust tracking The current tracker is very rudimentary and sensitive to multiple detections or missing detections. There is a strong multi target tracker in development at FOI that was attempted but should be investigated further since it could improve the results significantly. Multi scale operations To do some calculations at several scales might improve the results and speed of calculations. The structure tensors in particular has little need for full scale image data when they are looking for large objects. Automated runs FOI is very interested in the possibility of automating the procedure further so that they could segment, detect object and perform background contrast calculations in large batches without operator intervention. This area could use the quality measures described in chapter 4 to give feedback on the results and determine if there is any corrections needed for some video. Active contours segmentation To get a good tracking point is easier, using structure tensors for example, than getting the entire contour. So to use active contours or snakes starting at guessed interest points and retrieve the object shape this way could get better and more robust data.[6] Multi spectral data An attempt at combining multi spectral data in a more scientific manner and study if there is any significant improvements in detection rates. A better mixing of the spectrum might achieve a better signal to noise ratio and detect more parts of objects that is currently occluded behind noise. 59 60 Further work Real time tracking Most methods are, if possible, currently written with this in mind and do not use “future” data in the calculations so that they would be easy to implement in a quicker language and possibly used to view the results instantly at image capture. This could remove noise and give operators a better view of a desired target and also be used to automatically keep a target in frame, automating another step of manual labour. This however require a very high robustness for different environments and targets to be quite useful. Appendix A Dual tensor, N g e k used in This section will describe the algorithm for creating the dual tensors N quadrature filtering. Since the theory is quite heavy we will only be presenting the algorithm for the ordinary 2D case and any number of filter directions. Recall the filter direction vector n̂ from the section on quadrature filtering at page 27. We will use k filters and as such the filter directions n̂k ∈ [0, π) Begin by taking the outer product of each direction vector to create the matrices Nk Nk = n̂k n̂Tk Then reshape the Nk matrices into one larger one with k rows and 3 columns1 N1 (1, 1) N1 (1, 2) N1 (2, 2) N = N2 (1, 1) N2 (1, 2) N2 (2, 2) .. .. .. . . . Then we must account for the removed data by multiplying column 2 with this preserves the norm. √ N1 (1, 1) √2N1 (1, 2) N1 (2, 2) 2N2 (1, 2) N2 (2, 2) N = N2 (1, 1) .. .. .. . . . e now we create F √ 2, e = N(NT N)−1 F e is related to N e by the same reshape and rescaling steps performed above in F reverse i.e. √ e 1 (1, 1) e 1 (1, 2) N e 1 (2, 2) 2N N √ e e 2 (1, 2) N e 2 (2, 2) e = 2N F N2 (1, 1) .. .. .. . . . 1 Since matrices produced by outer products necessarily are symmetric and directional vectors have 2 elements the outer product will only have 3 unique elements 61 Bibliography [1] Cecie Starr. Biology: Concepts and Applications. Thomson Brooks/Cole, 2005. ISBN 053446226X. [2] James Robert Mahan. Radiation heat transfer: a statistical approach. John Wiley & Sons, New York, 2001. [3] J. Murray. The acadamy, 1878. [4] Chris Harris & Mike Stephens. A combined corner and edge detector, 1988. [5] Vibha L, Chetana Hegde, P Deepa Shenoy, Venugopal K R, and L M Patnaik. Dynamic object detection, tracking and counting in video streams for multimedia mining. Technical report, Bangalore University, 2008. [6] Milan Sonka, Vaclav Hlavac, and Roger Boyle. Image Processing, Analysis, and Machine Vision. Thomson, 1 edition, 2007. ISBN 0-495-24438-4. [7] A. M. Baumberg & D.C. Hogg. An efficient method for contour tracking using active shape models. In IEEE Computer Society, 1994. In Proceeding of the Workshop on Motion of Nonrigid and Articulated Objects. [8] N. J. B. McFarlane & C. P. Schofield. Segmentation and tracking of piglets in images, Machine Vision and Applications, 1995. [9] Cris Stauffer & W.E.L. Grimson. Adaptive background mixture models for real-time tracking, 1999. IEEE Conf. on Computer Vision and Pattern Recognition. [10] Hans Knutsson. EDUPACK LiU.CVL.ORIENTATION. http://www.cvl.isy.liu.se/education/tutorials/edupackorientation/orientation.pdf, 2009. [11] Gösta H Granlund & Hans Knutsson. Signal Processing in Computer Vision. Kluwer Academic Publishers, 1995. [12] Hanan Samet & Markku Tamminen. Efficient component labeling of images of arbitrary dimension represented by linear bintrees, 1988. [13] Cheng Y. Mean shift, mode seeking and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 603–619, 1995. 62 Bibliography 63 [14] Fukunaga K & Hostetler L. The estimation of gradient of a density function, with applications in pattern recognition. IEEE Transactions on Information Theory, pages 32–40, 1975. Upphovsrätt Detta dokument hålls tillgängligt på Internet — eller dess framtida ersättare — under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/ Copyright The publishers will keep this document online on the Internet — or its possible replacement — for a period of 25 years from the date of publication barring exceptional circumstances. The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for his/her own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/ c Sebastian Möller
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
advertisement