Using Colour Templates for Target Identification and Tracking Simon Brock- Gunn* Machine Vision Group, Computer Science Department, City University, London e-mail: simonfflcs.city.ac.uk Tim Ellis Machine Vision Group, Centre for Information Engineering, City University, London e-mail: t . j .ellisfflcity.ac.uk Abstract A surveillance system is presented which uses colour cues to track people moving in sparse crowd scenes. The erratic motion of the targets, together with their changeable outline, means that they are conventionally difficult to model. However, by examining the colours present in an object in a given frame of a sequence, and looking for these in a later frame, identification and tracking are achieved. The colours are transformed into a template space in which it is easy to match objects to those held in a database, so that even when a person is occluded or disappears totally from view they may be re-located as soon as they can be clearly seen again. A hierarchical approach to template storage and searching reduces the effort required to search through the database and ensures that the system is efficient even with a database containing the templates of hundreds of people. Since work is carried out on a moving sequence, the problems of changing object shape and maintaining colour constancy are minimised by allowing for small changes in these parameters between frames and by continually updating each object's template. 1 Introduction When examining the problem of following people in sparse crowd scenes, such as in a shopping centre or entering and leaving an aircraft, a number of possible solutions arise. At first, it might seem sensible to model the shape and motion of a person, in order to be able to distinguish them from the rest of the scene . Unfortunately, this task is made difficult by the changing shape of a person, as the camera perceives it, when they move around or change direction and because their movements are influenced by many external and personal factors which are unavailable to a system. Even if we suppose that it is possible to track without modelling the shape , any system that follows the target from frame to frame will fail as soon as the targets become occluded when they pass in front of each other or behind stationary objects, such as an advertising 'Supported by a SERC award. BMVC 1992 doi:10.5244/C.6.22 208 hoarding or the aircraft, if it is unable to relocate them when they re-appear. One further consideration which can be overriding, even in current systems, is the storage space and processing power required to maintain large numbers of object representations and to match them between frames in a sequence. It is clear that if a more general approach is taken to analysing the problem, any practical implementation that arises will be more feasible, but the question remains as to which properties of the application should be examined. Colour - in machine vision in general and object analysis in particular - is a property which would seem obvious, yet is apparently under-exploited since almost all current machine vision systems involve the use of only monochrome data. When even black and white images take up so much storage space and necessitate so much processing themselves, it cannot seem an attractive option to increase this by at least three-fold in order to utilise the extra information provided by colour data. However, with advances over recent years in computer technology, this is no longer such a concern, and now one of the more salient reasons may be the fact that colour is not an absolute property; rather it depends as much on the frequency content of the illumination as on the surface reflectance. The perceived colours of a target outdoors will change as a cloud passes in front of the sun, or as the target moves into the shadow of a building. The human visual system overcomes some of these problems by using a poorly-understood mechanism referred to as colour constancy , whereby the perception of constant colours is maintained under varying illumination levels. However, as yet no widely-accepted algorithms exist in machine vision which will reliably mimic colour constancy and, as a result, colour is seen as an unreliable property. In this research we are analysing dynamic scenes and so we are able to minimise this problem by allowing the small changes in illumination, and thus the reflected colours, which will occur from frame to frame to fall within a small tolerance level and by continuously updating the object templates to take these changes into account. The way we use colour properties enables us to claim that our method is not sensitive to the small changes in object shape, viewing angle and scale which may also occur between frames. The use of template definitions naturally avoids the need for large amounts of storage, since a template takes up a fraction of the amount of space that a colour representation of the object requires, and quick object matching within the database is achieved by the use of a hierarchical system of data representation. This ensures that time is not wasted in matching pairs of templates which clearly represent different objects (as will be the case on the majority of occasions) and instead such matches may be dismissed quickly. Colour templates allow the representation of objects without the need for any type of model. This quality means they can be used in identifying objects of unknown shape and size from a list of possible choices , but the principle can be extended so that objects are "learned" by a system, based on the quality of match within frame sequences. 2 Target Identification and Tracking The underlying technique of the system presented is to look at the objects seen in each frame of a sequence, and learn the combination of colours present 209 Tj £— Object Extraction Template Calculation Template Tracking Figure 1: The Processing Cycle, explained in sections 2.1 - 2.3 in each object. This combination then becomes associated with the object and is remembered during the rest of the sequence, enabling identification and tracking of targets. The processing cycle, show in Figure 1, is carried out on each frame and consists of three parts which are detailed below. Firstly, Object Extraction is carried out to locate any moving objects within the field of view and extract them from the image frame. This part of the process is not central to the ideas being presented here and, in our system, a simple and crude technique is used. When the objects have been found, the Template Calculation process examines the colours present within each of them and quantifies these colours by means of a template which can easily be compared to all those previously encountered in the Template Tracking process. The object will be located in a previous frame because of its similar colour "signature", providing a link between the object in two different frames and, therefore, determining its motion. 2.1 Object Extraction The application uses a static camera, which allows us to assume that there is a one-to-one correspondence between movement in the scene and pixel changes in a frame sequence. This ensures that a simple operation is all that is required in order to extract the objects from a given frame image. Before the processing begins, a colour reference image, R, is captured and stored. This image represents the background scene, which is composed only of static objects which will remain stationary during processing. Each time around the cycle, this reference image is compared to the incoming colour image frame, F. A resultant binary difference image, D, is then calculated by taking the difference between the intensity images of these two frames and 210 subjecting it to a threshold, tj, to reduce noise errors: /-) u _ / 1 ^ Fx,y — Rx,y > id *<y ~ \ 0 otherwise Fxy and Rxy may be defined in a number of ways, the simplest being the sum of the red, green and blue point values at position x, y in each respective frame. Then the object frame, O, is determined by taking an empty image and then copying in all the areas from the frame image where there is an entry in the difference image: _ / FX,y ifDXiy = l 0 otherwise This object frame now contains all the objects which have moved between the acquisitions of R and F, presented on a blank background. These objects are then separated using a simple exploration algorithm which takes all the objects from O over a threshold size, t0 (again, to reduce the effects of noise errors), and relocates them in single object sub-images, O\, . . . , On. 2.2 Template Calculation There are many different ways of choosing a template, T, such that it represents in some way the colours present in one of these objects, O. Given that the raw data of an object sub-image is generally composed of planes of the three primary colours, red, green and blue, the most obvious and straightforward to calculate is a three-dimensional colour frequency histogram, T//, which may be created of suitable size n x n x n (so n will be the number of bins on each axis), where each THrg>b represents the amount of the object which has the colour (r,g,b). However, we can employ a model that more closely resembles human physiological processes  by examining opponent colours. This still gives us three linearly-independent axes, but whereas (r, g,b) represent dimensions of black to red, green and blue respectively, the opponent colour axes of (rg, by, wb) represent dimensions of red to green, blue to yellow and white to black respectively. The transformation  from the input data is therefore / J o , , , , , , , = [r-g, 2b-r-gr r +g+ b , This type of template has been used to provide object detection cues  for the recognition of static objects, and Figure 2 shows how a template in the rg and by dimensions might look for a person wearing a red shirt and blue trousers. For this template, n = 16, so there are sixteen bins on each axis, and regions of intensity (represented by darker areas) are more prevalent in the red and blue parts of the template. The use of these axes ensures that the template is invariant to rotation, translation and reflection of the object in the image plane since it is only the amount of each colour which is being measured, not its location. Further analysis of the application shows that it is clear that the objects under study are subject neither to two-dimensional rotation in the image plane 211 — red rg green —• 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Figure 2: r<7 and by plane of a sample template nor to reflection. This means that the discriminatory power of the colour may be made greater by weighting the information with respect to its position within the object. When we look at this in the context of our application, it can be seen that the person wearing a red shirt and blue trousers will cause there to be highlights in the appropriate areas of the template. However, someone else wearing a blue shirt and red trousers might be associated with a very similar template, yet it is clear to the observer that these are very different people. By making the spatial distribution of the colours an important factor, it is possible to ensure that such confusion does not arise, without compromising the generality of the system. To this end, we add two further axes to the template which implement this spatial weighting: r and 9. These are analogous to the polar co-ordinates of a given point, o, in an object, O, such that, as shown in Figure 3, if c is the "centre of gravity" of the object then 0 — Lvco; r = l/r, where v is the point of intersection between the circle, centre c, radius R, which circumscribes the object and a vertical line through c, and / is the distance between o and c. Finally, it can be recognised that wb is simply a measure of monochrome point intensity, a property which is more susceptible to shadows and changes in lighting than the others being measured. Furthermore, since there are already many methods which make use of monochrome techniques to track objects, and it is our aim to develop the use of a colour technique which may be comple- 212 Figure 3: calculation of the r and 9 values mentary to these methods, this axis is discarded. Thus we are left with a template of four dimensions: rg, by, r and 9. 2.3 Template Tracking The template, T, is subject to a comparison technique together in turn with each of those already stored in the database, T\, ..., T m , where m is the number of templates stored. These TJ- are the templates representing all the objects so far encountered. The match coefficient, v, is calculated as n - l n - l R 2n v= EEEE rg=Oby=Or=O0=0 Trg,by,r,8 Ti rg,by,r,0 ST where n - l n - l R 2ir r$=0 6y=0r=0 6=0 such that 0 < v < 1, where v = 0 indicates a perfect match (templates identical) and t> = 1 indicates a perfect mismatch (templates completely differ). The scaling down by the sum of values in the template, which is the same as the number of image points in the corresponding object, ensures that the coefficient is invariant to scale. This coefficient is then subject to previously determined minimum and maximum thresholds, tmin and tmax respectively: • If v < tmin then the two templates are similar enough to say that they represent the same object. In this case, 7< is updated by replacing it with the template definition of T and the object represented by these templates is said to have moved from the position associated with Ti to the position associated with T. 213 • If v > tmax then the two templates are different enought to say that they represent different objects. If v > tmax V TJ then T is defined as representing an object not previously encountered. A new template, Tm+x, is created in the database, consisting of the template definition of T and the object is said to be at the position associated with T. * If tmin < v < tmax the templates are neither sufficiently similar nor sufficiently dissimilar to make a firm decision and the match is said to be inconclusive. This may occur if, for example, T represents a partiallyoccluded object or two objects occluding each other — situations in which we do not wish to make any judgements about the object associated with T. In this case, T is discarded and there is therefore no information with which to make any decision about the position of any object. The values of tmin and tmax are determined by experiment, and will be implementation-independent. Ideally, <min —• 0 and tmax —> 1 and it is only the frame-to-frame change in parameters as well as the usual noise errors which mean that more realistic values need to be chosen. These values will depend on the distribution of unique colours across targets, so that if the objects are all similarly coloured tmin and tmax will need to be close together in value, whereas when the objects are all distinctly coloured they may be further apart. 3 A Hierarchical Approach The use of a four-dimensional template means that matching can take a considerable time. A modestly-sized template with sixteen bins on each axis necessitates the comparison of over 65000 pairs of values. In a small application, with 30 different people in the database, millions of calculations would have to be performed in the Template Matching stage for each single frame. Even if each calculation took only one millisecond, Template Matching would take several seconds per frame — clearly unacceptable when the requirement is for a real-time system. The fact that so much time needs to be spent in these comparisons seems particularly absurd since the most cursory inspection by an observer reveals that almost all of the template pairs are clearly different. The approach taken to alleviate this problem is to manufacture a "pyramid" of templates for each object, instead of just one . This pyramid consists of progressively coarser-resolution templates so, for example, a sixteen-bin template would be accompanied by eight-bin, four-bin and two-bin templates. Although this provides a lot of useful new information, the overheads of this extra work are very low. If using four-dimensional, sixteen-bin templates, the highestresolution template in a pyramid would hold 65536 values (164), whereas the next-highest would only hold 4096 (84), the next 256 (44) and finally the lowestresolution which would contain just 16 (24) values — a combined total of less than ten percent extra. Furthermore, since each successively lower-resolution template is simply interpolated from the next higher-definition template by averaging the appropriate surrounding values, minimal further processing time is required. In this hierarchical system, each comparison begins with the lowest definition resolution pair of templates. Of course, in almost all cases, the templates 214 Database size 30 100 1000 Single Template System Values Calculations (millions) (millions) 2.0 2.0 6.6 6.6 65.5 65.5 Hierarchical System Calculations Values (millions) (millions) 2.1 0.1 0.1 7.0 0.5 69.9 Table 1: Storage requirements for a template database and processing requirements for comparing a new template with those currently in the database, exemplifying the advantage of a hierarchical system over a single-template system will actually be representing different objects, and in the majority of these cases, the difference between the templates is sufficient at this low level to determine that they are not representing the same object. Only if v < Tmax, such that the objects are not definitely different, need the process continue with the next higher resolution pair, and so on until the highest resolution pair confirm that the templates are representing the same object (or otherwise). In practice, when trying to match an object's template with 30 others stored in a database, on about 25 occasions the lowest-resolution, two-bin template will provide sufficient information to dismiss the match, and only five will require examination of some of the higher-definition templates, with only one (the correct object match) necessitating a look at the highest-definition, sixteen-bin templates. As Table 1 shows, the advantages of the pyramid system become abundantly clear as the size of the database increases. 4 Experimental Results A simple sequence of frames was chosen in order to demonstrate the system in practice. This sequence depicted two people walking towards each other, passing (occluding) and walking away and was augmented with several still frames of other people, in order to provide a database of seven people. A hierarchical arrangement of four dimensional-templates was used, with sixteen bins on each axis of the highest-resolution template. The system successfully tracked the two moving people and suspended tracking for the period of occlusion, as required. Within the hierarchical implementation, all of the matches into the database were determined to be false at the two-bin or fourbin template level (except, of course, the correct match). This was to be expected since the subjects chosen for the database were all clearly different to the observer, and the choice of more similar-looking subjects would undoubtedly require comparison of the higher-resolution templates. Figure 4 shows six sample frames from the sequence. Alongside these is shown one of the targets which has been selected to be tracked. During the third frame, this target is lost, as it is occluded. However, the target is matched and found again for the fourth and subsequent frames. 215 Figure 4: Six frames from the test sequence and a target which has been tracked 5 Conclusion Since the system proposed is very general in nature, it clearly has many potential applications in addition to our own. However, because of its particular advantages, there are three areas of tracking to which it is particularly suited: • Where the target objects are of similar shape and/or size. There is no need to model any object-based parameters such as shape or size so, for example, it would be able to follow people in a street or cars on a road, whereas a system based on distinction by shape and size alone would fail. • Where the target objects have irregular motion. Since there is no predictive element involved, targets may be tracked no matter how they move through the scene since at any given time they may be expected to appear in any part of the image or, indeed, not at all. • Where object occlusion occurs. If an object disappears occluded for a while it will be re-located when it can be again as long as the change in colour reflectance over period falls within the allowable small tolerance level, so to attempt to track it through every single frame. from view or is sufficiently seen the intervening there is no need • Where a large database of objects is necessary. Many database systems suffer because the time taken to search through the database for an object match is proportional to the number of objects held (or even proportional to some higher power of the number of objects held). By using a hierarchical storage and processing system, the time taken to search through the database increases only slowly as the database grows. It is not suggested that the system presented here is capable of performing in all situations, rather that combination with existing techniques may be the best way of providing a solution to any given application problem. 216 6 Further Work The Object Extraction process does not form an explicit part of this work and the simple method chosen has proved to be somewhat unreliable under certain circumstances where areas of the background sometimes show up as "moving" due to a change in lighting conditions between acquisition of the reference image, R, and the current image, F. These are currently removed by a filter based on allowable object sizes but this is not a satisfactory solution since, although generally successful when using the system indoors, the fluctuations sometime cause processing of outside scenes to break down. As one of the aims of this work is to reduce problems associated with colour constancy, a different method of generating R needs to be employed which will ensure the background is constantly updated [7, 1]. Furthermore, there are potential applications where the camera will not be stationary and so a more sophisticated method of Object Extraction must be used which will also take into account ego-motion. The usefulness of this system depends on the ability to carry out processing in real time and, although the use of hierarchical storage and processing helps enormously in the Template Tracking stage (especially with a large database), the time involved in producing a template from an input frame is still of the order of several seconds (on a Sun SPARCstation). Although the ability to process at frame-rate is not a requirement for real-time processing, some rationalisation of program and storage structures must be undertaken to reduce the program cycle time towards the sub-one second scale. Specialist hardware would probably be required to achieve this but the repetitive nature of the tasks involved make this an ideal candidate for implementation in a parallel environment. References  A. T. Ali and E. L. Dagless. Computer vision for security surveillance and movement control. In IEE Colloquium on Electronic Images and Image Processing in Security and Forensic Science, pages 6/1-6/7. IEE, May 1990.  Dana H. Ballard and Christopher M. Brown. Computer Vision. Prentice-Hall, Inc., Eaglewood Hills, NJ, 1982.  Peter J. Burt. Smart sensing within a pyramid vision machine. In Proceedings of the IEEE, volume 76, pages 1006-1015. IEEE, August 1988.  David Hogg. Model-based vision: a program to see a walking person. Image and Vision Computing, l(l):5-20, February 1983.  Leo M. Hurvich and Dorothea Jameson. An opponent-process theory of color vision. Psychological Review, 64(6):384-404, 1957.  Peter Lennie and Michael D'Zmura. Mechanisms of color vision. CRC Critical Reviews in Neurobiology, 3:333-400, 1988.  Simon W. Lu. A multiple target tracking system. In Proceedings of the SPIE, volume 1388, pages 299-305, 1991.  Michael J. Swain. Color indexing. Ph.D. thesis, University of Rochester, New York, November 1990.
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project