IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE. VOL. 16, NO. 5, MAY 1994 449 Motion T racking with an Active Camera Don Murray and Anup Basu, Member, IEEE Abstract-This work describes a method for real-time motion detection using an active camera mounted on a pan/tilt platform. Image mapping is used to align images of different viewpoints so that static camera motion detection can be applied. In the presence of camera position noise, the image mapping is inexact and compensation techniques fail. The use of morphological filtering of motion images is explored to desensitize the detection algorithm to inaccuracies in background compensation. Two motion detection techniques are examined, and experiments to verify the methods are presented. The system successfully extracts moving edges from dynamic images even when the pan/tilt angles between successive frames are as large as 3”. Section II contains a brief overview of some of the previous work in topics related to tracking. A description of the tracking system used in this work is given in Section III. The relationship between camera frame positions and pixel locations at different pan/tilt orientations are investigated in Section IV. Section V explains the methods of motion detection that were explored and developed. In Section VI, we show the results of our motion detection algorithms for a sequence of real images. Section VII discusses some of the limitations imposed on the system by synchronization error and noise filtering. II. PREVIOUS WORK I. INTR~DUC~~N M OTION detection and tracking are becoming increasingly recognized as important capabilities in any vision system designed to operate in an uncontrolled environment. Animals excel in these areas, and that is because motion is inherently interesting and important. In any scene, motion represents the dynamic aspects. For animals, motion can mean either food or danger, matters of life and death. For a mobile robot platform, motion can imply a chance of collision, dangers to navigation, or alterations in previously mapped regions. Tracking in computer vision, however, is still in the developmental stages and has had few applications in industry. It is hoped that tracking combined with other technologies can produce effective visual servoing for robotics in a changing work cell. For example, recognizing and tracking parts on a moving conveyor belt in a factory would enable robots to pick up the correct parts in an unconstrained work atmosphere. In this work, we will consider tracking with an active camera. Active vision [4], [S], [2] implies computer vision implemented with a movable camera, which can intelligently alter the viewpoint so as to improve the performance of the system. An active camera tracking system could operate as an automatic cameraman for applications such as home video systems, surveillance and security, video-telephone systems, or other tasks that are repetitive and tiring for a human. Recently, it has been shown that tracking facilitates motion estimation In general, there are two approaches to tracking that are fundamentally different. These are recognition-based tracking and motion-based tracking. Recognition-based tracking is really a modification of object recognition. The object is recognized in successive images and its position is extracted. The advantage of this method of tracking is that it can be achieved in three dimensions, Also, the translation and rotation of the object can be estimated. The obvious disadvantage is that only a recognizable object can be tracked. Object recognition is a high-level operation that can be costly to perform. Thus, the performance of the tracking system is limited by the efficiency of the recognition method, as well as the types of objects recognizable. Examples of recognition-based systems can be found in the work of Aloimonos and Tsakiris [3], Bray [8], and Gennery [ 121, Lowe [ 161, W ilco et al. [26], Schalkoff and McVey [22], and others [24], [9], [21]. Motion-based tracking systems are significantly different from recognition-based systems in that they rely entirely on motion detection to detect the moving object. They have the advantage of being able to track any moving object regardless of size or shape, and so are more suited for our systems. Motion-based techniques can be further subdivided into optic flow tracking methods and motion-energy methods, as described in Sections II-A and II-B. A. Optic Flow Tracking The field of retinal velocity is known as optic Jlow [15], [20], [25]. The difficulty with optic-flow tracking is the extracManuscript received October 13, 1992; revised May 27, 1993. This work tion of the velocity field. By assuming that the image intensity was supported in part by the Canadian Natural Sciencesand Engineering can be represented by a continuous function f(z: y, t): we can ResearchCouncil under Grant OGPO10539,and by a fellowship from the use a Taylor series expansion to show that [I 11. province of Alberta. Recommendedfor acceptanceby Associate Editor Y. Aloimonos. af af af O=~u+&utdt: (1) D. Murray is with MacDonald Detwiler, 13800 Commerce Parkway, Richmond, BC, V6V 273 Canada. A. Basu is with the Computing ScienceDepartment,University of Alberta, where u = dx/dt and v = dy/dt are the instantaneous Edmonton, AB, Canada T6G 2Hl and with TelecommunicationsResearch 2-D velocity at (2, y). This is a convenient equation since Labs, Edmonton, AB, Canada TSK 2P7. IEEE Log Number 9215856. af/at,af/ax and i3f/ay all can be locally approximated. 0162-8828/93$03.00 0 1993 IEEE 450 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 16, NO. 5, MAY 1994 The difficulty applying (1) is that we have two unknowns and only one equation. Thus, this equation describes a line on which (n, r~) must lie, but we cannot solve for (.u.,u) uniquely without imposing additional constraints, such as smoothness [IO], [13], [23]. Some interesting work has been done with partial optic-flow information from an active camera image sequence to detect motion [ 181. The significance of this work is that motion detection is achieved from a fully active camera. The algorithm can be evaluated quickly and hence is a computationally efficient method for detecting motion from an unconstained platform. The drawback is that the evaluation is qualitative, and hence it is less discriminatory than a quantitative approach. In addition, in order for the approximation in (1) to be valid, image points should not move more than a few pixels between successive images. This implies that the pan/tilt angles between consecutive frames must be very small. Our system is not constrained by this restriction. Further discussion on this topic can be found in Section VI and VII. Since determining a complete optic-flow field quantitatively is both expensive and ill-posed, solving the problem for a few discrete points has been a popular alternative for practical systems [6]. This method relies on identifying points of interest (also known asfeature.s) in a series of images and tracking their motion [7]. The disadvantage with this technique is that the points of interest in each scene must be matched to those of the previous image, which is generally an intractable problem. The difficulties increase in the case of an active camera. Since the scene viewed is dynamic, certain points will pass beyond the field of view while new ones will enter (drop-ins and dropouts), which increases the difficulty of searching for matching points. The complexity of this problem makes it unsuitable for real-time applications. for pipeline architectures, which allow it to be readily implemented on most high-speed vision hardware. One disadvantage of this method is that pixel motion is detected but not quantified. Therefore, one cannot determine additional information, such as the focus of expansion. Another disadvantage is that the techniques discussed are not suitable for application on active camera systems without modification. Since active camera systems can induce apparent motion on the scences they view, compensation for this apparent motion must be made before motion-energy detection techniques can be used. III. NOTATION AND MODELING A. Notation Since we often discuss point locations in both two and three dimensions, it is important to differentiate between them. A location in 3-D is written symbolically in capital letters as (X, Y, 2) or is presented as a column vector P: where P= ;[I. X 1 Two-dimensional points are written in lower case, such as (x6,Y). Arbitrary homogeneous transformations are formulated as a 4 x 4 matrix T. where For the pan/tilt camera parameters, B. Motion Energy Tracking Another method of motion tracking is motion-energy detection. By calculating the temporal derivative of an image and thresholding at a suitable level to filter out noise, we can segment an image into regions of motion and inactivity. Although the temporal derivative can be estimated by a more exact method, usually it is estimated by simple image subtraction: ax, Y, t) M f(X! Y, 0 - .f(x, Y, t - 6t) dt 6t This method of motion detection is subject to noise and yields imprecise values. In general, techniques for improving image subtraction include spatial edge information to allow the extraction of moving edges. Picton [ 191 utilized edge strength as a multiplier to the temporal derivative prior to thresholding. Allen et al. [I] used zero crossings of the second derivative of the Gaussian filter as an edge locator and combined this information with the local temporal and spatial derivatives to estimate optic flow. For practical, real-time implementations of motion detection, image subtraction combined with spatial information is the most widely used and successful method. In addition to computational simplicity, motion-energy detection is suitable f 6’ Q Y is is is is the focal length of the camera, the tilt angle from the level position, a small angle of rotation about the pan axis, and a small angle of rotation about the tilt axis. B. Pin-Hole Camera Model Throughout this work, the pin-hole camera model is used. As shown in Fig. I, let OXYZ be the camera coordinate system. The image plane is perpendicular to the Z-axis and intersects it at a point (O! 0, f), where f is the focal length. Using this model, the relationships between points in the image plane and points in the camera coordinate system are C. PanlTilt Model The active camera considered in this work is mounted on a pan/tilt device that allows rotation about two axes. Fig. 2 shows the schematics of such a device. The Cohu-MPC system is show in Fig. 3. The reference frame for each camera position is formed by the intersecting axes of rotation (pan = Y -axis, tilt = X-axis). The origin of the camera coordinate system is MURRAY AND Fig. Pin-hole 1. BASU: MOTION camera TRACKING WITH AN ACTIVE 451 CAMERA Fig. model. 3. Cohu-MPCpan/tilt device. where 0 0 px 0 0 0000 1 1 0 pz 1 PY 1 C(B-0 T _ = [ 1 and c and s are used to abbreviate cos and sin, respectively. IV. BACKGROUND COMPENSATION Fig. 2. Schematicsof pan/tilt device. located at the lens centre, which is related to the reference frame by a homogeneous transformation T, such that Pr = Tcf’c, where P, and PC are 3-D points in the reference and camera frames, respectively. For an arbitrary tile angle B, the camera transformation T, is C(e) = RNx(~)Tcls=o PX cepy - sepz sopy + cHpz 1 1 To be able to apply the motion detection techniques to be introduced in Section V, we must compensate for the apparent motion of the background of a scene caused by camera motion. Our camera is mounted on a pan/tilt device and hence is constrained to rotate only. This is ideal for background compensation, since visual information is invariant to camera rotation [ 141. Our objective in background compensation is to find a relationship between pixels representing the same 3-D point in images taken at different camera orientations. The projection of a 3-D point on the image plane is formed by a ray originating from the 3-D point and passing through the lens center. The pixel representing this 3-D point is given by the intersection of this ray with the image plane (see Fig. 1). If the camera rotates uhout the lens center, this ray remains the same, since neither endpoint (the 3-D point and the lens center) moves due to this rotation. Consequently, no previously viewable points will be occluded by other static points within the scene. This is important, since it implies that there is no fundamental change in information about a scene at different camera orientations. It should be noted that, for theoretical considerations, the effect of the image boundary is ignored here. Obviously, regions which pass outside the image due to camera motion cannot be recovered. For camera rotation, the only components of the system that move are the camera coordinate system and the image plane. An example of this motion is shown in Fig. 4. The relationship between every pixel position in two images, taken from different positions of rotation about the lens center, has been derived by Kanatani [ 141. In the Appendix, we show that for small displacements of the lens center from the center IEEE TRANSACTIONS ON PA-lTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL. 452 (a) Fig. 4. Fig. 5. Static camera sequence. Fig. 6. Image subtraction and edge images. 3-D point projected on two image planes with the same lens center. of rotation, Kanatani’s relationship remains valid. For an initial inclination (0) of the camera system and pan and tilt rotations of N and y, respectively, this relationship is xt-1 =.f -acos8xt + -yyt + -asinf)xt Yt-1 =f -crcosBxt+~yt+f’ INDEPENDENT Detection (b) f MOTION DETECTION As mentioned in Section II, motion-energy detection has been demonstrated as a successful and practical motion detection approach for real-time tracking. Our implement is therefore based primarily on this technique. Yet, because of the potential error incurred during camera motion compensation, modifications must be made to the motion detection methods. In this section we first discuss, in some detail, motion-energy detection. Then we describe what measures are taken in order to modify these techniques to achieve camera systems. A. Motion (b) + yYt- fr where f is the focal length. With knowledge of f, 0, y, and cy, for every pixel position (ztr yt) in the current image we can calculate the position (xt-r, yt-1) of the corresponding pixel in the previous image. V. (4 16, NO. 5, MAY 1994 with a Static Camera In practice, motion-energy detection is implemented through spatio-temporal filtering. The simplest implementation of motion-energy detection is image subtraction. In this method, each image has the previous image in the image sequence subtracted from it, pixel by pixel. This is an approximation of the temporal derivative of the sequence. The absolute value of this approximation is taken and thresholded at a suitable level to segment the image into static and dynamic regions. Fig. 5 shows two frames of an image sequence taken with a static camera. Fig. 6 (left) shows the result of image subtraction. The drawback of this technique is that motion is detected in regions where the moving object was present at times t and t - St. This means that the center of the regions of motion is close to the midpoint between the actual positions of the object at t and t-St. For systems with a fast sampling rate (small St) (a) (b) Fig. 7. Moving edges detected in frame 2 with the two techniques. compared to the speed of the moving object, the difference in position of the object between frames will be small, and hence the midpoint between them may be adequate for rough position estimates. For objects with high speeds relative to the sampling rate, we must improve this method. Our aim is to estimate the position of the moving object at time t. This can be achieved by the following steps. First, we obtain a binary edge image of the current frame by applying a threshold to the output of an edge detector. (An example of the resultant edge image is shown in Fig. 6(b).) Then this information is incorporated into the subtracted image by performing a logical AND operation between the two binary images, i.e., the edge image and the subtracted image. This highlights the edges within the moving region to obtain the moving edges within the latest frame. Fig. 7(a) shows the result of this operation. As can be seen, edges are also highlighted in the area previously occluded by the moving object. Since these edges have only been viewed for one sample instant, it is unreasonable to expect the system to be able to detect whether or not they are moving, until the next image is taken and processed. A modified approach was suggested by Picton [19]. He argued that thresholds are empirically tuned parameters and, in order to keep the system as simple and robust as possible, the number of tuned parameters should be minimized. Hence, he proposed reducing thresholding to a single step by multiplying the prethreshold values of the edge strength and image subtraction to obtain a value indicative of both edge strength and temporal change. This product is then thresholded, and MURRAY AND BASU: MOTION TRACKING WITH AN ACTIVE 453 CAMERA of camera translation corrupts the compensation method. Also, errors in the pan/tilt position sensors and camera calibration contribute to the compensation inaccuracy. The following section shows how we overcome this compensation noise and improve the robustness of our method. C. Robust Motion Detection with an Active Camera (b) (a) Fig. 8. Moving camera Fig. 9. Image subtraction sequence. (b) (a) with and without compensation. thus the tuned parameters are reduced to a single threshold value. The result of this multiplication method is shown in Fig. 7(b). As we can see, the boundary of the moving object is the same as with the logical AND method. However, with Picton’s method more interior edges are emphasized. B. Motion Detection by an Active Camera For a stationary camera, the pixel-by-pixel subtraction described in Section V-A is possible since, with a static scene, a given 3-D point will continuously project to the same position in the image plane. For a moving camera this is not the case. To apply pixel-by-pixel comparison with an active-camera image sequence, we must map pixels that correspond to the same 3-D point to the same image plane position. Section IV outlined the geometry behind invariance to rotation and derived the mapping function between images. For each pair of images processed in the image sequence, the image at time t - 6t is mapped so as to correspond pixel by pixel with the image at time t. Regions with no match between the two images are ignored. The image subtraction, edge extraction, and subsequent moving edge detection are performed as in the static case. Fig. 8 shows two images taken from different camera orientations. Fig. 9(a) shows the results of image subtraction without compensation. Clearly the apparent motion caused by camera rotation has made the static camera methods unsuitable. The object of interest appears to be moving less than the background. Fig. 9(b) shows the results after background compensation. Notice that the background has not been entirely eliminated. This is due to inaccuracies in the inputs to the compensation algorithm and approximations made in the algorithm derivation. The background compensation algorithm was derived with the assumption that rotation occurs about the lens center. In reality this is not the case for our system, and.the small amount If we could achieve exact background compensation, the methods described so far would be sufficient. In the presence of position inaccuracies, however, the results deteriorate rapidly. W e use edge information in our techniques to detect moving objects. Ironically, regions with good edge characteristics are the most sensitive to compensation errors during image subtraction. That is, false motion caused by inaccurate compensation will be greatest in strong edge regions, yet these are the very regions that are considered as candidates for moving edge pixels. This problem makes the method discussed unreliable. Since errors in angle information are inevitably present, it is desirable to develop methods of motion detection that can robustly eliminate the false motion they cause. Errors in pan/tilt angles can be due to sensor error. For a realtime system with a continuously moving camera, there is the additional problem of synchronization. If the instances of grabbing an image and reading position sensors are not perfectly synchronized, the finite difference in time between these events can be considered as error in position sensing. This error is calculated as 0, = w x At, where ee is the error in angular position, w is the angular velocity of rotation, and At is the synchronization error. Since few vision systems are designed with this consideration in mind, the problem is a common one in active vision applications. Fig. 9(a) shows an example of the results of image subtraction after inaccurate background compensation. Notice the region of the moving object contains a broad area where the true motion was detected, whereas false motion is characterized by narrow bands bordering the strong edges representing the background. Our approach to removing the false motion utilizes the expectation of a wide region of true motion being present. Using morphological erosion and dilation (morphological opening), we eliminate narrow regions of detected motion while preserving the original size and shape of the wide regions. The method of robust motion detection is shown in a block diagram in Fig. 10. Morphological Filtering: Morphological filters applied to digital images have been used for several applications, including edge detection, noise suppression, region filling, and skeletonizing [ 171. Morphological filtering is essentially an application of set theory to digital signals. It is implemented with a mask M overlaying an image region I. For morphological filtering, the image pixel values, namely IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 16, NO. 5, MAY 1994 (4 Subtract Absolute Value V W Morpholqid Opening J V > Threshold Fig. 10. Moving (e) Block diagram of method for robust motion detection. Fig. I I. are selected by the values of M as members of a set for analysis with set theory methods. Usually, values of elements in a morphological filter mask are either 0 or 1. If the value of an element of the mask is 0, the corresponding pixel value is not a member of the set. If the value is 1, the pixel value is included in the set. We can express the elements of the image region selected by the morphological mask as a set A such that The morphological operations we will consider are erosion and dilation. Erosion of A is: EA = min (A). Dilation of A is: DA = max (A). By applying erosion to the subtracted image, narrow regions can be eliminated. If the regions to be preserved are wider than the filter mask, they will only be thinned, not completely eliminated. After dilation by a mask of the same size, these regions will be roughly restored to their original shape and size. If the erosion mask is wider than a given region, that region will be eliminated completely and not appear after dilation. Fig. 11 shows the subtracted image of Fig. 9 (right) eroded by different size masks. For this particular image sequence, we can see that to completely eliminate the noise due to position inaccuracies we must use a mask size of 11 x 11. VI. EXPERIMENTAL RESULTS The experimental results presented here use a sequence of images taken with the pan/tilt device. The image sequence is processed off-line; however, all modules except robust back- (0 Subtracted image with various sizes of erosion masks applied. ground compensation have already been implemented in real time. Two methods of motion detection were tested: thresholding the temporal and spatial derivatives independently, and multiplication of derivatives prior to thresholding. The image sequences and the processed results are presented in this section. The camera is mounted on the Cohu-MPC, a pan/tilt device. Instructions for this device are sent from a Sun3 over a serial interface. The Sun3 is mounted on a VMEbus with a realtime frame digitizer board. The camera is a CCD device with standard video output. The Cohu-MPC allows controls of rotation about two axes (pan and tilt) as well as adjustment of zoom and focus settings. The position sensing of the pan/tilt axes is done by a potentiometer coupled to each driving motor shaft. Fig. 12 shows an image sequence taken with the pan/tilt device. Figs. 13 and 14 show the results of motion detection methods applied to the image sequence with an 11 x 11 morphological mask. The results shown have been generated using the two motion detection techniques presented in Section V. The two approaches are summarized as follows. Approach I: Binary images of the spatial and temporal derivative peaks are formed by thresholding the subtracted and edge strength images. These two binary images are then ANDed together to extract the moving edges in the scene. Approach 2: The values of the spatial and temporal derivatives are multiplied, and the product is thresholded to extract the moving edges. Both approaches use two 3 x 3 sobel edge detection kernels to find the edge strength in the vertical and horizontal directions. MURRAY AND BASU: MOTION TRACKING WITH AN ACTIVE CAMERA 455 64 (b) Cd) (e) Fig. 12. Pan/tilt image Fig. sequence. I (a) (b) Cd) W Fig. 13. Moving edges in pan/tilt image sequence (Approach 1). Although both approaches work reasonably well, they exhibit significantly different characteristics. Approach 1 detects primarily the boundary of the moving object, since the independent thresholding of the edge image tends to eliminate 14. Moving edges in pan/tilt image sequence (Approach 2). edges within the contour of the object. Approach 2 does not remove the contribution of the fainter edges until after multiplication with the temporal derivative. Thus we find that many interior edges of the moving object are revealed. By considering the centers of motion produced by the two motion detection techniques, we see that the actual position estimates are nearly identical. However, the methods do yield different moving edge images, and there is potential for significantly different results in certain conditions. In the image sequence, the moving object has rich texture due to the folds in clothing, while the background is characterized by homogeneous blocks of similar intensity broken by strong edges. The texture provides a dense area of weak edges that can be brought out by Approach 2, which subjectively seems to be the preferred approach for this image sequence. Yet in the results of both approaches, background edges that were occluded by the moving object in the previous frame are detected as moving edges. Since our system has only viewed these regions for a single frame, it is unable to determine whether these edges are static or dynamic until the next frame is processed. If the background had more varied texture, such as a wheat field or a chain-link fence, regions previously blocked by the moving object would have the same weak edges brought out by Approach 2. This would corrupt the moving edge signal and tend to produce a centroid of the moving object that lags the objects’ true position. In the results presented here, an 11 x 11 mask was used. It was found empirically that for a 9 x 9 mask false motion increases marginally, while for masks of size 7 x 7 and smaller, for this image sequence and position data, the motion detection degrades severely. The false motion detected in 456 IEEETRANSACITONS ON PATTERN ANALYSIS AND MACHINEINTELLIGENCE,VOL. both image sequences is due to either previous occlusion (as discussed above) or inaccurate background compensation. To eliminate false motion caused by inaccurate background compensation, it is important to select the appropriate sized filter for noise removal. However, an increase in filter size places additional computational burden on the filtering stage, as well as possibly eliminating the true motion signal. The relationship between position noise and filtering requirements is presented in Section VII. VII. ANALYSISOF COMPENSATION xt + asinByt Yt-1 =f =f -acosBxt + facosB ++yyt + f -asin8xt +yt - fr -acos~xt+yyt+ f’ (2) Rotation xt + fa: Q-1 =ffeaxt’ Yt-1 = f f “‘,,t . To evaluate zt-l(a+A,) and yt-l(a+A,), we will approximate the function with a first-order Taylor series expansion, yt--l((l:+ A,) (5) where x, and ye are the errors in mapped pixel position in the x and y directions, and A, and A, are inaccuracies in measurement of the rotations (Y and y, respectively. For evaluating the error in pixel mapping, we consider several cases depending on the location of (xt, yt). In general, the error in the mapped pixel position is greater as we move farther from the center of the image. We use pixel positions in the image center to simplify the error equations when determining general error chracteristics, and border pixels to determine the worst-case behavior. To simplify our discussion 0 is constrained to 0, i.e., the camera is assumed to be at the level position. dYt-1 + da: =yt-l(a) (6) A a. (7) Substituting (6) and (7) into our equations for error in the compensated pixel position ((4) and (.5)), we obtain 2 h-1 xe = arr& 2 = f (;fj2 8Yt-1 Ye = - da Aa = fcf z(t;;t,2 A, (8) Acr- (9) The magnitude of the error in pixel position e is shown in Fig. 1.5 and can be expressed as e 2 =' x: + yy,2. (10) For pan-only rotation, the error is predominantly in the x direction, since the change in the y component for pixels at different viewpoints is affected by changes in perspective only. Therefore, e2 M x,2 e z x,. To determine the pan-angle error A, for a given pixel mapping error, from (8) we obtain A (4) 1994 For the pan-only case, the tilt rotation (Y), is 0. Hence, (2) and (3) are mduced to (3) The error between the correct pixel position and that found with inaccurate angle information in the mapping algorithm can be expressed as xe =xt-l(a,y)-xt-l(a+A,,y+A,) Ye =yt-1(&T) - Yt-I( a+A,,y+A,), Error for Pan-Only h-1 xt--l(a + A,) = ~--1(a) + -&a INACCURACY As discussed in Section V, noisy position information corrupts the background compensation algorithm and necessitates additional noise removal. Morphological filtering has been presented as a technique to remove narrow regions of false motion from subtracted images. For effective noise removal to occur, the morphological erosion mask must be at least as wide as the regions of false motion. If the mask is not wide enough, some noise will remain after erosion and will be expanded to its original size during dilation. This means that no noise will be removed. This method of noise removal is therefore an all-or-nothing approach. The advantage of this scheme is that for acceptable noise levels, false motion is completely removed. However, if the noise exceeds the filter capacity, no noise removal takes place. Because of this behavior, it is important to use filters large enough to completely remove the expected noise. At the same time, for computational reasons, it is desirable to limit filtering to the minimum required. This motivates us to investigate the relationship of noise characteristics to filtering requirements. Recall the mapping algorithm from Section IV xt-1 A. Compensation 16.NO.5.MAY c2 = xe(f - axtj2 f(G+f2) . Once A, is determined, we can solve for ye using (9) to verify our initial assumption that ye is negligible. For our system, where f = 890 and the maximum xt = 255 we can make a table of values A, for given errors x,, the corresponding y error for this position, ye! and e for magniude of (x,! ye). The value for (Y used 5”, since this is a good cut-off point for the approximate sin cr = (Y and hence is the upper bound for which our system is designed. The results are shown in Table I. Note that the relationship between A, and x, is linear. B. Compensation Error for Pan and Tilt Rotations The error in pixel mapping is more difficult to obtain if both pan and tilt rotations are made. However, to gain insight into the general characteristics of the error, we will consider a special case, where (x, y) = (O,O): which is the pixel that lies directly along the Z-axis of the camera coordinate system. For this case, the pixel mapping functions are xt-1 = fcl Y/t-l =fr, MURRAY AND BASU: MOTION TRACKING WlTH AN ACTIVE CAMERA 457 TABLE II WORST-CASE COMPENSATION ERROR e magnitude of errs fr Ye error in y-direction xe error in x-direction Magnitude of pixel mapping error. Fig. 15. TABLE PAN-ONLY Y? 4 (in degrees) (in pixels) (in pixels) 1 2 3 4 5 6 7 8 9 10 11 0.055 0.110 0.165 0.220 0.275 0.330 0.385 0.440 0.495 0.550 0.605 I 2 3 4 5 6 7 8 9 10 11 0.055 0.110 0.165 0.220 0.275 0.330 0.385 0.440 0.495 0.550 0.605 I COMPENSATION ERROR (in :Aels) an (in degrees) Ye (in pixels) e (in pixels) 4 5 6 7 8 9 10 11 0.057 0.113 0.170 0.226 0.283 0.340 0.396 0.452 0.509 0.566 0.622 0.076 0.152 0.228 0.303 0.379 0.455 0.53 1 0.607 0.683 0.759 0.835 1.003 2.006 3.009 4.011 5.029 6.017 7.020 8.023 9.026 10.029 11.032 For the worst-case error, (xt: yt) = (255, -255). The angles of rotation were set to cy = y = 5”. Solving for A, and A, in (12) and (13), we generated Table II. The magnitude of the angle error for given x, and yYeare the same, which implies that we have a circle again, but with slightly smaller radii. This signifies less angle error for a given error in pixel mapping. For an n x 71 morphological filter to remove noise caused by this compensation error e, the filter must be at least as wide as the error. that is n 2 e. C. Significance and our error equations become xe =f& ye =fA,. Thus, the magnitude of the error can be expressed by e2 = xp + y: = f”Az + f”A:. (11) Plotting lines of constant error in terms of A, and A, yields a series of concentric circles with radii of 7-X” f for any constant error e. Unfortunately, the error for the center pixel is not the worst case. To estimate the worst-case error, we use the worst-case x, with no tilt error and the worst-case ye with no pan error to determine the A, and A, intercepts of the constant error curves. For each case, we again use a first-order Taylor series expansion. For x, this is ax:,-1 x, = $Aa + -A,. 8-Y Since A, = 0 for the A, axis intercept, this simplifies to -c, zz +& a* (in pixels) zz f x: + YYtf + f2 (-w + YYt + f 1”. (12) Similarly, of Error Analysis As shown in Section VII-A and VII-B the pixel mapping error is linearly dependent on the magnitude of the error in angle information ( AZ + A;). In this section we investigate the consequences od this error and the constraint it places on our system. Maximum Speed of Tracking: We assume that the primary source of angle error in a real-time implementation is due to synchronization. For a fixed filtering strategy we can determine the upper bound on the speed of rotation for our system, and thus the maximal angular velocity of a target that can be successfully tracked. For rotation w&b an angular velocity is Of Wmax > the angular error caused by poor synchronization 8, = wma,At, (14) where At is the error in timing. Since compensation error is linearly dependent on angular position error, the compensation error is e = KO,, where K is a constant determined by the system parameters. In the example given in Section VII-B, K = l/0.054961. For a morphological mask of size n x n: the error tolerance will be 7~. Thus, the boundary condition is n = KB,. (15) Substituting (14) into (15) and solving for w,,, , we obtain dYt-1 ye = oaf - Yt2- f2 8-Y A-l = f(- cyxt + YYt + f 1”. (13) W max n = m. 458 IEEE TRANSACTIONS ON PATTERN We can see that as synchronization error At increases, the maximum possible angular velocity decreases. On the other hand, as the size of the morphological filter TZ increases, so does w,,. Minimum Speed of Tracking: For a moving target with a very slow angular velocity relative to the camera, it is possible that the target will not be detected, since any motion caused by it will be removed by the morphological filter. If we consider a target moving at the slowest detectable speed, w,i,, the angle covered by this target each sample instant t, will be emin = (16) Wmints. ANALYSIS AND MACHINE APPENDIX 0min = arctan 4. xt + asinByy, + facos0 Zt-1 . Substituting (16) into (17) and solving for W,i, yields =f (18) -a: If we are using an n x n filter mask, the object must move a minimum of n + 1 pixels to be identified. This implies Yt-1 =.f cos 8xt -asinBxt -(Ycosext + yyt + + yt - fy + yyt + f’ (19) P, = TX,, . Thus, to detect a slowly moving object it is desirable to either decrease the filter size nT or increase the sample time t,. Filtering and Sampling Strategy: From the above analysis it follows that the desirable filter size is not identical for different moving objects. Ideally, we would like to set the filter size according to the camera motion and the estimated motion of the target. Initially, before any target is acquired, the camera may remain stationary with no filtering required. As an object is tracked and the angular velocity is estimated and predicted, the optimal filtering solution could be determined. The difficulty with this adaptive filtering strategy is that to implement different sized filters on a constantly changing basis is not realistically implementable on most pipeline imageprocessing architecture. where P, is the point in the camera frame and T, is the 4 x 4 transform relating the camera frame to the reference frame. The current camera frame, T,(t), is a result of a pan/tilt rotation of the previous camera position T, (t - 1). Any point in the reference frame can be represented in terms of camera frames before and after motion P, = T,(t)P,(t) = T,(t - l)P,(t - 1). It follows that the position of a point in one camera frame can be related to its position in the other by PJt - 1) = T,(t - l)-lTJt)P=(t). Since T,(t) Tc(t - l), (20) is the result of applying pan and tilt rotations to T,(t) = IZot,,(cY)Rotx(y)T,(t VIII. f Proof: Any point in the reference frame can be expressed as a vector P,. and is related to its position in the camera frame by the transformation n+l arctan f ts 1994 In Section IV, the need for a mapping function between pixels of images at different camera orientations was explained. As stated, the equations that achieve this are thus = 16, NO. 5, MAY A. Camera Rotation Compensabon Algorithm d = f tan emin, Wmin VOL. We have alredy implemented motion detection, position estimation, and camera movements in real time with a MaxVideo system and the pan/tilt camera. Currently, we are working on implementing the robust compensation algorithm on the MaxVideo to improve continuous tracking results. In the future, we plan to investigate the application of an active zoom lens in our system. For instance, wide angle is preferable for erratically moving objects, whereas zooming in improves position estimates for a relatively stable object. The use of color images will also be considered to enhance the motion detection algorithm. The distance on the image plane this angle will cover can be calculated by n+l Bmin = arctan ~ f INTELLIGENCE, - 1) CONCLUSION This work describes methods of tracking a moving object in real time with a pan/tilt camera. With images compensated for camera rotations, static techniques were applied to active image sequences. Since compensation is susceptible to errors caused by poor camera position information, morphological filters were applied to remove erroneously detected motion. A relationship between the level of noise and the size of the morphological filter was also derived. Our technique reliably detected an independently moving object even when the pan/tilt angles between two consecutive frames was as large as 3”. and T,(t - 1)-l = T,(t)-%oty(a)Rotx(y). (21) By substituting (21) into (20) we PJt - 1) = T,(t)-%oty(a)Rotx(y)T,(t)P,(t). (22) Since the sampling is assumed to be fast, the angles Q and y are assumed to be small. For small angles, we can use the approximations CQ, C-YM 1 SQ,S+y x a,y sa, 3-y M 0. 459 MURRAY AND BASU: MOTION TRACKING WITH AN ACTIVE CAMERA Since we know the current tilt from level position 0, we can determine the current camera transformation and its inverse as rl 0 0 01 0 ct9 -St9 py Tc(t) = 0 so ct9 p* Lo 0 (23) 0 11 0 ;; 0 -“P, - sBP* . SOP, - COP, 1 1 0 (24) Substituting (23) and (24) into (22) and simplifying the resulting matrix using trigonometric identities, we obtain X,-l = xt + asOYt + acezt + ap, Yt Yt-1 = -asBXt + zt-, = -acoxt + yYt + z, + ycf?p, - ysop,. - yzt + ystpl, - qYx9p, Using the perspective projection equation, we have xt--1 = f xt + CyseYt + CYcez, + ap, [email protected] + Yy, + zt + -yc0p, - Ysept . (25) Now, dividing the top and bottom of the right-hand side of (25) by Zt and simplifying, xt-1 = bt f -(YcBxt + yyt + f + f YCOP, + ,YSOP* zt Similarly, the expression for yt-1 can be obtained. Assuming that depth is large compared to the other parameters, we can neglect the last terms of both numerator and denominator and obtain (18) and (19). REFERENCES P. Allen, B. Yoshimi, and A. Timcenko, “Real-time visual servo&.,” Tech. Rep. CUCS-035-90, Columbia Univ., Pittsburgh, PA, Sept. 1990. Y. Aloimonos, “Purposive and qualitative active vision,” in Proc. PI DARPA Image IJnde&tanding Workshop, Pittsburgh, PA, 1990, pp. 816828. [31 Y. Aloimonos and d. T&iris, “On the mathematics of visual tracking,” Image Vision Computing, vol. 9, no. 4, pp. 235-251, 1991. [41 Y. Aloimonos, I. Weiss, and A. Bandopadhay, “Active vision,” Int. J. Comput. Vision, vol. 2, pp. 333-356, 1988. PI R. Bajcsy, “Active perception,” in Proc. IEEE, vol. 76, no. 8, pp. 996-1005, 1988. 161D. H. Ballard and C. M. Brown, Computer Vision. New York: Prentice-Hall, 1982. [71 B. Bhanu and W. Burger, “Qualitative understanding of scene dynamics for moving robots,” Int. J. Robotics Research, vol. 9, no. 6, pp. 74-90, 1990. WI A. J. Bray, “Tracking objects using image disparities,” image Vision Computing, vol. 8, no. 1, pp. 4-9, Feb. 1990. [91 R. A. Brooks, Model-based Computer Vision. AM Arbor, MI: UMI Research Press, 1984. [lOI J. H. Duncan and T. Chou, “Temporal edges: The detection of motion and the computation of optical flow. ” in Proc. Second Int. Conf Comput. Vision. Tampa, FL, Dec. 1988, pp. 374382. [Ill C. Fermueller and Y. Aloimonos, “Tracking facilitates 3-D motion estimation.” Biolopical Cvbern.. vol. 67. DD. 25%268. 1992. [II CT , (121 D. B. Gennery, “Tracking known three-dimensional obiects.” in Proc. AAAI 2nd Na;. [email protected] on A.I.. Pittsburgh, PA, 1982, pp. 13-17. B. K. P. Horn and B. G. Schunk, “Determining outic flow.” Artificial [I31 Intell., vol. 17, pp. 185-203, 1981. 1141 K. Kanatani, “Camera rotation invariance of image characteristics,” Comput. Vision, Graphics, Image Processing, vol. 39, no. 3, pp. 328-354, Sept. 1987. [W H. G. Longuet-Higgins and K. F’razdny, “The interpretation of a moving retinal image.” in Proc. Roy. Sot. London, B, vol. 208, pp. 385-397, 1980. [I’51 D. G. Lowe, “Three-dimensional object recognition from single twodimensional images,” Artificial Intell.. vol. 31, no. 3, pp. 355-395, 1987. u71 P. Maragos, ‘Tutorial on advances in morpholoeical image urocessine and analysis,” Opf. Eng., vol. 26, no. 7, pp. 623632, Juli lb87. [I81 R. C. Nelson, “Qualitative detection of motion by a moving observer.” in Proc. IEEE Comput. Society Co& Computer Vision and Pattern Recognit., Maui, HI, June 1991, pp. 173-178. [I91 P. D. Picton, “Tracking and segmentation of moving objects in a scene,” in Proc. 3rd Int. Conf. Image Processing and its Applicat., Coventry, England, 1989, pp. 389-393. WV K. Prazdny, “Determining the instantaneous direction of motion from optic flow generated by a curvilinearlv moving observer.” Comout. Vision Graphics, Image Processing, vo1.*17, pp. 238-248, 1981. ’ VII R. W. Roach and J. K. Aaaarwal, “Computer tracking of obiects moving in space,” IEEE Trans. I%ttern Anal. Michine Intell~, vol. PAMI-1, no: 2, pp. 127-135, 1979. [=I R. J. Schalkoff and E. S. McVey, “A model and tracking algorithm for a class of video targets,” IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-4, no. 1, pp. 2-10, 1982. WI B. G. Schunk, “The image flow constraint equation,” Comput. Vision, Graphics, Image Processing, vol. 35, pp. 20-40, 1986. (241 R. S. Stephens, “Real-time 3-D object tracking,” Image Vision Computing, vol. 8, no. 1. pp. 91-96, 1990. of visual motion by biological and computer v51 S...Ullman, “Ana&& system,” IEEE Cornput., vol. 14, no. 8. ~a. 5749, 1981. WI g. Wilcox, D. B. dennery, B. Bon, and’?. Litwin, “Real-time modelbased vision system for object acquisition and tracking,” SPIE, vol. 754, 1987. Don Murray was born in Edmonton, Alberta, Canada, on March 3, 1963. He received the B.Sc. degree in 1986 and the M.Sc. degree in 1992, both in electrical engineering, from the University of Alberta, AB, Canada. He is currently in the DEA Program for Robotics and Computer Vision at the University of Nice, Nice, France. His research interests include computer vision, statistical image processing, and active vision. Anup Basu (S’89-M’90) received the B.S. degree in mathematics and statistics and the MS. degree in computerscience, both from the Indian Statistical Institute, and the Ph.D. degree in computer science from the University of Maryland, College Park. He has been with Tata Consultancy Services, India, and the Biostatistics Division, Strong Memorial Hospital, Rochester, NY. He is currently an Assistant Professor at the Computing Science Department, University of Alberta, Alberta, Canada, and an Adjunct Professor at Telecommunications Research Labs. Edmonton, Alberta, Canada. His research interests include computer vision, robotics, and teleconferencing.

Download PDF

advertisement