Linkoping Studies in Science and Technology. Dissertations No. 379 Focus of Attention and Gaze Control for Robot Vision Carl-Johan Westelius Department of Electrical Engineering Linkoping University, S{581 83 LINKOPING, Sweden. Linkoping 1995 ii Abstract This thesis deals with focus of attention control in active vision systems. A framework for hierarchical gaze control in a robot vision system is presented, and an implementation for a simulated robot is described. The robot is equipped with a heterogeneously sampled imaging system, a fovea, resembling the spatially varying resolution of a human retina. The relation between foveas and multiresolution image processing as well as implications for image operations are discussed. A stereo algorithm based on local phase dierences is presented both as a stand alone algorithm and as a part of a robot vergence control system. The algorithm is fast and can handle large disparities and maintaining subpixel accuracy. The method produces robust and accurate estimates of displacement on synthetic as well as real life stereo images. Disparity lter design is discussed and a number of lters are tested, e.g. Gabor lters and lognorm quadrature lters. A design method for disparity lters having precisely one phase cycle is also presented. A theory for sequentially dened data modied focus of attention is presented. The theory is applied to a preattentive gaze control system consisting of three cooperating control strategies. The rst is an object nder that uses circular symmetries as indications for possible object and directs the xation point accordingly. The second is an edge tracker that makes the xation point follow structures in the scene. The third is a camera vergence control system which assures that both eyes are xating on the same point. The coordination between the strategies is handled using potential elds in the robot parameter space. Finally, a new focus of attention method for disregarding lter responses from already modelled structures is presented. The method is based on a ltering method, normalized convolution, originally developed for ltering incomplete and uncertain data. By setting the certainty of the input data to zero in areas of known or predicted signals, a purposive removal of operator responses can be obtained. On succeeding levels, image features from these areas become 'invisible' and consequently do not attract the attention of the system. This technique also allows the system to eectively explore new events. By cancelling known, or modeled, signals the attention of the system is shifted to new events not yet described. iv PREFACE This thesis is based on the following material: C-J Westelius, H. Knutsson, and G. H. Granlund. Focus of attention control. In Proceedings of the 7th Scandinavian Conference on Image Analysis, pages 667{674, Aalborg, Denmark, August 1991. Pattern Recognition Society of Denmark. C-J Westelius, H. Knutsson, and G. H. Granlund. Preattentive gaze control for robot vision. In Proceedings of Third International Conference on Visual Search. Taylor and Francis, 1992. J. Wiklund, C-J Westelius, and H. Knutsson. Hierarchical phase based disparity estimation. In Proceedings of 2nd Singapore International Conference on Image Processing. IEEE Singapore Section, September 1992. H. Knutsson, C-F Westin, and C-J Westelius. Filtering of uncertain irregularity sampled multidimensional data. In Twenty-seventh Asilomar Conf. on Signals, Systems & Computers, Pacic Grove, California, USA, November 1993. IEEE. G. H. Granlund, H. Knutsson, C-J Westelius, and J Wiklund. Issues in robot vision. Image and Vision Computing, 12(3):131{148, April 1994. C-J. Westelius and H. Knutsson. Hierarchical disparity estimation using quadrature lter phase. International Journal on Computer Vision, 1995. Special issue on stereo, (submitted). C-J. Westelius, C-F. Westin, and H. Knutsson. Focus of attention mechanisms using normalized convolution. IEEE Trans on Robotics and Automation, 1996. Special section on robot vision. (submitted). v vi Material related to this work but not explicitly reviewed in this thesis: C-J Westelius and C-F Westin. Representation of colour in image processing. In Proceedings of the SSAB Conference on Image Analysis, Gothenburg, Sweden, March 1989. SSAB. C-J Westelius and C-F Westin. A colour representation for scalespaces. In The 6th Scandinavian Conference on Image Analysis, pages 890{893, Oulu, Finland, June 1989. C-J Westelius, G. H. Granlund, and H. Knutsson. Model projection in a feature hierarchy. In Proceedings of the SSAB Symposium on Image Analysis, pages 244{247, Linkoping, Sweden, March 1990. SSAB. Report LiTH{ISY{I{1090, Linkoping University, Sweden, 1990. M. Gokstorp and C-J. Westelius. Multiresolution disparity estimation. In Proceedings of the 9th Scandinavian conference on Image Analysis, Uppsala, Sweden, June 1995. SCIA. J. Karlholm, C-J. Westelius, C-F. Westin, and H. Knutsson. Object tracking based on the orientation tensor concept. In Proceedings of the 9th Scandinavian conference on Image Analysis, Uppsala, Sweden, June 1995. SCIA. Contributions in books and collections: C-J Westelius, H. Knutsson J. Wiklund, and C-F Westin. Phasebased disparity estimation. In J.L. Crowley and H. I. Christensen, editors, Vision as Process, pages 179{192, Springer-Verlag, 1994. ISBN 3-540-58143-X. C-J Westelius, H. Knutsson, and G. Granlund. Low level focus of attention. In J.L. Crowley and H. I. Christensen, editors, Vision as Process, pages 157{178, Springer-Verlag, 1994. ISBN 3-540-58143X. vii C-J Westelius, J. Wiklund, and C-F Westin. Prototyping, visualization and simulation using the application visualization system. In H. I. Christensen and J.L. Crowley, editors, Experimental Environments for Computer Vision and Image Processing, volume 11 of Series on Machine Perception and Articial Intelligence, pages 33{62. World Scientic Publisher, 1994. ISBN 981-02-1510-X. C-J Westelius. Local Phase Estimation. In G. H. Granlund and H. Knutsson, principal authors, Signal Processing for Computer Vision,pages 259{278. Kluwer Academic Publishers, 1995. ISBN 07923-9530-1. viii Acknowledgements Although my name alone is printed on the cover of this thesis, there are a number of people who, in one way or another, have a part in its realization. First of all, I would like to thank all the members of the Computer Vision Laboratory for being jolly good fellows. I will miss the weekly chats in the sauna (and the beer too). I thank my supervisor, Dr. Hans Knutsson, for his enthusiastic help without which this thesis would have been ready much sooner, but with much poorer quality. His intuition never cease astonishing me. I thank Prof. Gosta Granlund for giving me the opportunity to work in his group and for sharing ideas and visions about vision. I would like to give Catharina Holmgren a distinguished services medal for proof-reading this thesis over and over again. It must be extremely boring to read something you are not interested in and correct the same kind of mistakes all the time. I thank Dr. Klas Nordberg for taking the time reading and commenting on this thesis. What Catharina did with the language Klas did with the technical content. I would also like to express my gratitude to Dr. Carl-Fredrik Westin, my friend and colleague, for all his support, both scientically and morally. My special thanks to everybody in the \Vision as Process" consortium. It has been very stimulating to work with VAP. Many of the activities related to VAP have made the PhD-studies worthwhile (including the yearly pre-demo-panics). Finally, there is someone who eventually accepted that \soon" means somewhere between now and eternity. Thank you, Brita, for being, for caring, for loving. I promise: No more PhD-theses for me! x Contents 1 INTRODUCTION AND OVERVIEW 1 1.1 Background : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 1.2 Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3 2 LOCAL PHASE ESTIMATION 2.1 What is local phase? : : : : : : : : : 2.2 Singular points in phase scale-space : 2.3 Choice of lters : : : : : : : : : : : : 2.3.1 Creating a phase scale-space 2.3.2 Gabor lters : : : : : : : : : 2.3.3 Quadrature lters : : : : : : 2.3.4 Other even-odd pairs : : : : : 2.3.5 Discussion on lter choice : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3 PHASE-BASED DISPARITY ESTIMATION 3.1 Introduction : : : : : : : : : : : : : : : : : : : 3.2 Disparity estimation : : : : : : : : : : : : : : 3.2.1 Computation structure : : : : : : : : : 3.2.2 Edge extraction : : : : : : : : : : : : : 3.2.3 Local image shifts : : : : : : : : : : : 3.2.4 Disparity estimation : : : : : : : : : : 3.2.5 Edge and grey level image consistency 3.2.6 Disparity accumulation : : : : : : : : 3.2.7 Spatial consistency : : : : : : : : : : : 3.3 Experimental results : : : : : : : : : : : : : : 3.3.1 Generating stereo image pairs : : : : : 3.3.2 Statistics : : : : : : : : : : : : : : : : 3.3.3 Increasing number of resolution levels 3.3.4 Increasing maximum disparity : : : : 3.3.5 Combining line and grey level results : xi : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5 5 11 13 16 16 23 27 35 37 37 38 39 40 41 41 45 45 46 47 48 51 52 58 63 xii 3.3.6 Results on natural images : : : : : : : : : : : : : : : 75 3.4 Conclusion : : : : : : : : : : : : : : : : : : : : : : : : : : : 78 3.5 Further research : : : : : : : : : : : : : : : : : : : : : : : : 79 4 HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION 81 4.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : 4.1.1 Human focus of attention : : : : : : : : : : : : 4.1.2 Machine focus of attention : : : : : : : : : : : 4.2 Space-variant sampled image sensors: Foveas : : : : : 4.2.1 What is a fovea? : : : : : : : : : : : : : : : : : 4.2.2 Creating a log-Cartesian fovea : : : : : : : : : 4.2.3 Image operations in a fovea : : : : : : : : : : : 4.3 Sequentially dened, data modied focus of attention : 4.3.1 Control mechanism components : : : : : : : : : 4.3.2 The concept of nested regions of interest : : : : 4.4 Gaze control : : : : : : : : : : : : : : : : : : : : : : : 4.4.1 System description : : : : : : : : : : : : : : : : 4.4.2 Control hierarchy : : : : : : : : : : : : : : : : : 4.4.3 Disparity estimation and camera vergence : : : 4.4.4 The edge tracker : : : : : : : : : : : : : : : : : 4.4.5 The object nder : : : : : : : : : : : : : : : : : 4.4.6 Model acquisition and memory : : : : : : : : : 4.4.7 System states and state transitions : : : : : : : 4.4.8 Calculating camera orientation parameters : : 4.5 Experimental results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 81 : 81 : 83 : 84 : 84 : 85 : 89 : 89 : 89 : 91 : 94 : 94 : 95 : 96 : 97 : 105 : 112 : 113 : 119 : 125 5 ATTENTION CONTROL USING NORMALIZED CONVOLUTION 129 5.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : 129 5.2 Normalized convolution : : : : : : : : : : : : : : : : : : : : 130 5.3 Quadrature lters for normalized convolution : : : : : : : : 134 5.3.1 Quadrature lters for NC using real basis functions : 134 5.3.2 Quadrature lters for NC using complex basis functions : : : : : : : : : : : : : : : : : : : : : : : : : : : 137 5.3.3 Real or complex basis functions? : : : : : : : : : : : 138 5.4 Model-based habituation/inhibition : : : : : : : : : : : : : : 145 5.4.1 Saccade compensation : : : : : : : : : : : : : : : : : 146 xiii 5.4.2 Inhibition of the robot arm inuence on low image processing : : : : : : : : : : : : : : : : 5.4.3 Inhibition of modeled objects : : : : : : : : : 5.4.4 Combining certainty masks : : : : : : : : : : 5.5 Discussion : : : : : : : : : : : : : : : : : : : : : : : : 6 ROBOT AND ENVIRONMENT SIMULATOR 6.1 General description of the AVS software 6.1.1 Module Libraries : : : : : : : : : 6.2 Robot vision simulator modules : : : : : 6.3 Example of an experiment : : : : : : : : 6.3.1 Macro modules : : : : : : : : : : 6.4 Simulation versus reality : : : : : : : : : 6.5 Summary : : : : : : : : : : : : : : : : : A AVS PROBLEMS AND PITFALLS : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : level : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 146 : 149 : 151 : 153 155 : 155 : 158 : 158 : 164 : 165 : 169 : 170 173 A.1 Module scheduling problems : : : : : : : : : : : : : : : : : : 173 A.2 Texture mapping problems : : : : : : : : : : : : : : : : : : 175 xiv 1 INTRODUCTION AND OVERVIEW 1.1 Background A traditional view of a computer vision system has been that it is an analyzing system in one end and a responding one in the other as illustrated in Figure 1.1. The analyzing part supplies a model of the three-dimensional world derived from two-dimensional images. The world-model is then used by the responding part for action planning. Vision is considered to be a pre-action stage. The vision algorithms have to furnish a world model in ne detail. Every feature that the action planning system might need has to be calculated. The close relationship between analysis and response is not utilized. Image Response Analysis Generation Figure 1.1: The classical pipelined structure of a robot vision system. 1 2 INTRODUCTION AND OVERVIEW As an answer to this, the active vision paradigm has been developed over some ten years [6, 5, 50, 3, 4]. In short, active vision is based on the ability for the perceiving system to purposively change both external and internal image formation parameters, e.g. xation point, focal length, etc. Instead of squeezing every bit of information out of every image, the active vision system picks the bits that are easy to estimate in the continuous ow of images. The system adapts its behavior in order to get the bits of information that are important at the moment. This possibility to solve otherwise ill-posed problems, in combination with an appealing similarity to biological systems has thrilled the imagination of many researchers (including the author). The problem that arises is how to control the perception. How are the purposive actions generated? Clearly, the structure in Figure 1.1 is not appropriate for an active vision system. The work at the Computer Vision Laboratory at Linkoping University is aimed at an integrated analysis-response system where general responses are modied by data, and data is actively sought using proper responses [33]. One important property is that suciently complex and data-driven responses are built up by letting a general response command, invoked from higher levels, be modied by processed data entering from lower levels, to produce a specic command for the specic situation in which the system currently operates. Action commands also have an impact on input feature extraction and interpretation, e.g. the interpretation of optical ow is dierent when the head is moving from when it is still. The computing structure can be thought of as a pyramid with sensor inputs and actuator outputs at the base (Figure 1.2 on the facing page). The input information enters the system, and features of increasing abstraction are estimated as the information ows upwards. The particular advantage of this structure is that the output produced from the system leaves the pyramid at the same lowest level as the input enters. This arrangement enables the interaction between input data analysis and output response synthesis. This thesis discusses focus of attention and gaze control for active vision systems in this context. It should be emphasized that the algorithms described here are biologically inspired but not an attempt to model biological systems. 1.2 OVERVIEW 3 Levels of response specification Levels of information abstraction Sensor inputs Actuator outputs Real World Figure 1.2: An integrated analysis-response structure. To the left the signals from the sensor inputs are processed into descriptions of increasing abstraction. To the right general response commands are gradually rened into situation specic actions. 1.2 Overview Chapters 2 and 3 deal with estimation of disparity using local phase and are based on [85, 35, 73]. The local phase is explained, its invariances and equivariances are described, and its behavior in a scale-space is elaborated on in Section 2.1. A number of dierent types of phase estimating lters are described and tested with respect to scale-space behavior in Section 2.3. In Chapter 3, a hierarchical algorithm for phase-based disparity estimation, that can handle large disparities and still give subpixel accuracy is described. In Section 3.3, the lter dependence of the disparity algorithm behavior is tested and evaluated. In Chapter 4 a framework for a hierarchical approach to gaze control of a robot vision system is presented, and an implementation on a simulated robot is also described. The robot has a three-layer hierarchical gaze control system based on rotation symmetries, linear structures and disparity. It is equipped with heterogeneously sampled imaging systems, foveas, resembling the space varying resolution of a human retina. The relation 4 INTRODUCTION AND OVERVIEW between the fovea and multiresolution image processing is discussed together with implications for image operations. The chapter is based on [75, 76, 35]. Chapter 5 deals with how to implement a habituation function in order to reduce the impact of known or modeled image structures on data driven focus of attention. Using a technique termed 'normalized convolution' when extracting the image features allows for marking areas of the input image as unimportant. The image features from these areas then become 'invisible' and consequently do not attract the attention of the system, which is the desired behavior of a habituation function. Chapter 5 is published in [80, 49]. Finally, Chapter 6 describes the robot simulator that is used in the experiments throughout this thesis. 2 LOCAL PHASE ESTIMATION 2.1 What is local phase? Most people are familiar with the global Fourier phase. The shift theorem, describing how the Fourier phase is aected by moving the signal, is common knowledge. But the phase in signal representations based on local operations, e.g. lognormal lters [44], is not so well-known. The local phase has a number of interesting invariance and equivariance properties that makes it an important feature in image processing. Local phase estimates are invariant to signal energy. The phase varies in the same manner regardless if there are small or large signal variations. This feature makes phase estimates suitable for matching, since it reduces the need for camera exposure calibration and illumination control. Local phase estimates and spatial position are equivariant. The local phase generally varies smoothly and monotonically with the position of the signal except for the modulo 2 wrap-around. Section 2.2 discusses cases where the local phase behaves dierently. Furthermore, it is a continuous variable that can measure changes much smaller than the spatial quantization, enabling subpixel accuracy without a subpixel representation of image features. 5 LOCAL PHASE ESTIMATION 6 Phase is stable against scaling. It has been shown that phase is stable against scaling up to 20 percent [27]. The spatial derivative of local phase estimates is equivariant with spatial frequency. In high frequency areas the phase changes faster than in low frequency areas. The slope of the phase curve is therefore steep for high frequencies. The phase derivative is called local or instantaneous frequency [10]. There are many ways to approach the concept of local phase. One way is to start from the analytic function of a signal and design lters that locally estimate the instantaneous phase of the analytical function [10]. An alternative approach, used in this chapter, is to relate local phase to the detection of lines and edges in images. This chapter discusses onedimensional signals. The extension of the concept of phase into two or more dimensions is discussed in Section 4.4. Figure 2.1 on the next page shows the intensity prole over a number of lines and edges. The lines and edges are called events in the rest of this chapter. For illustration purposes the ideal step and Dirac functions have been blurred more than what corresponds to the normal fuzziness of a naturalistic image. The low pass lter used is a Gaussian with = 1:8 pixels. When designing the lters for line and edge detection it is important that they are insensitive to the DC component in the image since at surfaces are of no interest for edge and line detection. A simple line detector is: hline() = ,( + 1) + 2( ) , ( , 1) (2.1) However, this lter has a frequency too high to t the frequency spectrum of the signal in Figure 2.1. Convolving the lter with a Gaussian, = 2:8, tunes the lter to the appropriate frequency band (left of Figure 2.2 on page 8). The problem is to design an edge lter that \matches" the line lter. There are two requirements on an edge/line lter pair: 1. Detection of both lines and edges with equal localization acuity. 2. Discrimination between the types of events. 2.1 WHAT IS LOCAL PHASE? 7 Intensity 1 0.8 0.6 0.4 0.2 0 20 40 60 80 100 120 140 20 40 60 80 100 120 140 Figure 2.1: Intensity proles for a bright line on dark background at position = 20, an edge from dark to bright at position = 60, a dark line on bright background at position = 100, and an edge from bright to dark at position = 140. All lines and edges are ideal functions blurred with a Gaussian, ( = 1:8). Is there a formal way to dene a line/edge lter pair such that these requirements are met? The answer is yes. In order to see how to generate such a lter pair, study the properties of lines and edges centered in a window. Setting the origin to the center of the window reveals that lines are even functions, i.e. f (, ) = f ( ). Thus, lines have an even real Fourier transform. Edges are odd functions plus a DC term. The DC term can be neglected without loss of generality since neither the line nor the edge lters should be sensitive to it. Thus, consider edges simply as odd functions, i.e. f (, ) = ,f ( ), having an odd and imaginary transform. Now, take a line, fline( ), and an edge, fedge ( ), with exactly the same magnitude function in the Fourier domain, kFedge (u)k = kFline (u)k: (2.2) For such signals the line and edge lter should give identical outputs when applied to their respective target events, Hedge(u)Fedge (u) = Hline(u)Fline (u): (2.3) LOCAL PHASE ESTIMATION 8 Combining Equations (2.2) and (2.3) gives: kHedge(u)k = kHline(u)k: (2.4) Equation (2.4), in combination with the fact that the line lter is an even function with an even real Fourier transform, while the edge lter is an odd function having an odd and imaginary Fourier transform, shows that an edge lter can be generated from a line lter using the Hilbert transform which is: ( line(u); if u < 0 Hedge(u) = ,iH (2.5) iHline(u); if u 0 0.4 0.4 0.2 0.2 0 0 −0.2 −0.2 −10 −5 0 5 10 −10 −5 0 5 10 Figure 2.2: The line detector (left) and its Hilbert transform as edge detector (right). The line and edge detectors in Equation (2.5) are both real-valued which makes it possible to combine them into a complex lter with the line lter as the real part and the edge lter as the imaginary part: h() = hline() , ihedge (): (2.6) The phase is then represented as a complex value where the magnitude reects the signal energy and the argument reects the relationship between the evenness and oddness of the signal (Figure 2.3). A lter fullling Equations (2.5) and (2.6) is called a quadrature lter and is, in fact, the analytic signal of the line lter. Figure 2.4 shows that the output magnitude from a quadrature lter depends only on the signal energy and on how well the signal matches the lter pass band, and not on whether the signal is even, odd or a mixture thereof. The lter phase, on the other hand, depends on the relation between evenness and oddness 2.1 WHAT IS LOCAL PHASE? 9 θ Figure 2.3: A representation of local phase as a complex vector where the magnitude reects the signal energy and the argument reects the evenness and oddness relationship. of the signal relative to the lter center. The phase is not aected by signal energy. The polar plots at the bottom show the trajectory of the phase vector in Figure 2.3 when traversing the neighborhood around each event. Note how the phase value points out the type of event when the magnitude has a peak value. How is the phase from the quadrature lter related to the instantaneous phase of the analytic function of the signal? It is easy to show that convolving a signal with a quadrature lter is the same as convolving the analytic function of the signal with the real part of the lter: h() f () = (hline () , ihedge ( )) f () = hline() f () , iHifhline ( )g f () = hline() f () , ihline() Hiff ( )g = (2.7) hline() (f () , iHiff ()g) = hline ( ) fA( ): Since hline is a real lter sensitive to changes in a signal and fA is a signal with continuously changing phase, the phase of the lter output is an estimate of the instantaneous phase. Narrow band lters generally estimate phase better than broadband lters. If the signal is a sine function with a constant magnitude, the instantaneous phase will be estimated exactly. LOCAL PHASE ESTIMATION 10 20 40 60 80 100 120 140 20 40 60 80 100 120 140 20 40 60 80 100 120 140 1 0.5 0 3.14 1.57 0 −1.57 −3.14 Figure 2.4: Line and edge detection using the quadrature lter in Fig- ure 2.2 on page 8. Top: The input image. Second: The magnitude of the quadrature lter output has one peak for each event, and the peak value depends only on the signal energy and how well the signal ts the lter pass band. Third: The phase of the quadrature lter output indicates the kind of event. Bright lines have = 0, dark lines have = , dark to bright edges have = =2, and bright to dark edges have = ,=2. Bottom: Polar plots showing the phase vector in a neighborhood around the lines and edges. 2.2 SINGULAR POINTS IN PHASE SCALE-SPACE 11 2.2 Singular points in phase scale-space The phase is generally stable in scale-space. There are, however, points around which the phase has an unwanted behavior. At these points, the analytic signal goes through the origin of the complex plane, i.e. they are singular points of the analytic function. The invariances and equivariances described in Section 2.1 are generally not valid if the analytic signal is close to a singular point. Figure 2.5 on the next page shows a stylized example of a phase resolution pyramid. Assume we have an analytic signal consisting of two positive frequencies u0 and 3u0 FA(u) = (u , u0 ) + 2(u , 3u0 ) fA() = eiu + 2ei3u 0 0 (2.8) (2.9) Now suppose that we create the resolution pyramid using p a lter that attenuates the high frequency part of the signal a factor 1= 2 but leaves the low frequency part unaected. Since the signal is periodic, the behavior of the signal can be studied using a polar plot of the signal vector. Figure 2.5a shows a polar plot of the original signal vector. Its magnitude is fairly large and the vector runs counter-clockwise all the time. The phase is therefore monotonous and increasing, except for the wrap-around caused by the modulo 2 representation (Figure 2.5b). The local frequency, i.e. the slope, is positive and almost constant. In Figure 2.5c the amplitude of the high frequency part of the signal is reduced and the signal vector now comes close to the origin at certain points. The local frequency is much higher when the signal vector passes close to the origin, causing the phase curve to bend, but it is still monotonous and increasing. Further LP-ltering causes the signal vector to go through the origin (Figure 2.5e). In these points, the phase jumps discontinuously and the local frequency becomes impulsive. In Figure 2.5g, the signal vector moves clockwise when going through the small loops. This means that the phase decreases and that the local frequency is negative (Figure 2.5h). LOCAL PHASE ESTIMATION 12 a 3.14 2 b 1.57 0 0 −1.57 −2 −2 0 c 2 −3.14 0 3.14 2 50 d 100 50 f 100 50 h 100 50 100 1.57 0 0 −1.57 −2 −2 0 e 2 −3.14 0 3.14 2 1.57 0 0 −1.57 −2 −2 0 g 2 −3.14 0 3.14 2 1.57 0 0 −1.57 −2 −2 0 2 −3.14 0 Figure 2.5: The periodic signal in Equation (2.9) is LP ltered in four steps, attenuating the high frequency part of the signal. The left column shows polar plots of the signal vector and the right column shows the phase for the same signal. a) The signal vector circles the origin at a distance. b) The slope of the phase plot, i.e. the local frequency, is positive and almost constant. c) The signal vector round the origin closely. d) The phase curve bends, which means that the frequency is locally very high. e) The signal vector goes through the origin, i.e. singular points. f) The local frequency is impulsive. g)The the signal vector goes trough small loops without rounding the origin. h) The phase curve bends downward which means that the frequency is locally negative. 2.3 CHOICE OF FILTERS 13 This behavior of the phase in scale-space is due to the fact that the highfrequency part of the signal disappears at lower resolution. In Figure 2.5b the phase has three cycles for each period of the signal since the highfrequency part of the signal dominates. In Figure 2.5h, on the other hand, there is only one phase cycle per signal period since the high-frequency part is attenuated. To avoid singular points we can avoid considering points with very low magnitude, since the magnitude is zero at the singular points. Unfortunately, it is not that simple. The impact of a singular point is spread in scale; negative frequencies on coarser resolution and very high frequencies on ner resolution. At these points, the magnitude can not be neglected and a high enough threshold also cuts out many useful phase estimates. Fleet describes how singular points can be detected and how their inuence can be reduced [26]. A method that uses line images in combination with the original images to reduce the inuence of singular points is described in Chapter 3. 2.3 Choice of lters When designing lters to be used as disparity estimators, there are a number of requirements, some of which are mutually exclusive, to be considered. Dierent lter types have dierent characteristics and which one to use depends on the application. There are a number of dierent lters that can be used when measuring phase disparities. Gabor lters are, by far, the most commonly used in phase-based disparity measurement. They have linear phase, i.e. constant local frequency, and are therefore intuitively appealing to use. Quadrature lters do not have any negative frequency components nor any DC component. Dierence of Gaussians approximating the rst and second derivative of a Gaussian can also be used to estimate phase. LOCAL PHASE ESTIMATION 14 Intensity 150 100 50 0 0 50 100 Position 150 200 250 Figure 2.6: The signal that is used to test the scale-space behavior of the lters. Below a number of lters are evaluated with regard to the following requirements: No DC component. The lters must not have a DC component. Figure 2.7 on the facing page shows how a DC component makes the signal vector wag back and forth instead of going round. No wrap-around. It is desirable, though not necessary, that the phase of the impulse response runs from , to without any wrap around. This maximizes the maximal measurable disparity for a given size of the lter. Monotonous phase. The phase has to be monotonous, otherwise the phase dierence between left and right images is not a one-to-one function of the disparity. Below, the phase is called monotonous even though it might wrap around, since the wrap around is caused by the modulo 2 representation. Only one half-plane of the frequency domain. It is also a requisite that the lter only picks up frequencies in one half-plane of the frequency domain. This is a quadrature requirement which means that the phase must rotate in the same direction for all frequencies. If this does not apply the phase dierences might change sign depending on the frequency content of the signal. 2.3 CHOICE OF FILTERS 15 U0=0.76 BW=0.5 DC=1.32469e−08 Gabor filter phase 3 2 1 0 −1 −2 −3 0 50 100 150 spatial position 200 250 200 250 U0=0.76 BW=1.2 DC=0.0317708 Gabor filter phase 3 2 1 0 −1 −2 −3 0 50 100 150 spatial position Figure 2.7: Above, the phase from a Gabor lter with no DC- component. Below, the phase from a Gabor lter with broader bandwidth and thus a DC component on the signal in Figure 2.6. Note that the phase is going back and forth instead of wrapping around when the signal uctuation is small compared the DC level. In-sensitive to singular points. The area aected by the singular points has to be as small as possible, both spatially and in scale. As a rule of thumb the sensitivity to singular points decreases with decreasing bandwidth. This requirement is contradictory to the requirement of small spatial support. Small spatial support. The computational cost of the convolution is proportional to the spatial support of the lter function, i.e. the size of the lter, which therefore should be small. LOCAL PHASE ESTIMATION 16 2.3.1 Creating a phase scale-space The behavior of the phase in scale-space has been tested using the signal shown in Figure 2.6 on page 14 as input. All ltering has been done in the Fourier domain. For each lter type the DFTs of the lter with the highest center frequency have been generated using their denitions. The frequency function has then been multiplied by a LP Gaussian function: LP (u) = e , 2u22 (2.10) u p where u = = 2 2 . This emulates the LP-ltering in a subsampled resolution pyramid. It can be argued that the LP ltering should not be used at the highest resolution level, but it can be motivated by taking the smoothing eects of the imaging system into account. The lter function for each level is calculated by scaling the frequency functions appropriately: u Fu (u) = Fu u u 0 1 0 1 (2.11) Using linear interpolation between nearest neighbors enables non-integer scaling. Again, the method is chosen to resemble a subsampled resolution pyramid. Generating new lters for each scale will give better, but larger, lters, and it will not correspond to a subsampled resolution pyramid. 2.3.2 Gabor lters In literature, Gabor lters are chosen for the minimum space-frequency uncertainty and for the separability of center frequency and bandwidth. A Gabor lter tuned to a frequency, u0 , is created, spatially, by multiplying an envelope function by a complex exponential function with angular frequency u0 Equation (2.12). Gabor showed that a Gaussian envelope minimizes the space-frequency uncertainty product [29], a review is found in [53]. This means that the Gabor lters are well localized in both domains simultaneously. 2.3 CHOICE OF FILTERS 17 1 Magnitude 0.8 0.6 0.4 0.2 0 −π −3π/4 −π/2 −π/4 0 π/4 π/2 3π/4 π Frequency Figure 2.8: The magnitude of three Gabor lters in the frequency domain. u0 = f=8; =4; =2g and = 0:8 When designing a Gabor lter, the parameters are the standard deviation, , and the center frequency, u0 . These also eect the size and the bandwidth of the lter. The denition of a Gabor lter in the spatial domain is: (2.12) gu () = eiu p1 e, 2 and the denition in the frequency domain is: 2 0 0 Gu u (u) = e 0 2 2 , (u2(,uu0))2 2 (2.13) where u = 1=. The Gabor lters have linear, and thus monotonous, phase by denition. Since the Gaussian has innite support in both domains, it is impossible to keep the lter in the right half-plane. It is therefore theoretically impossible to avoid negative frequencies and a DC component. For practical purposes the Gaussian can be considered to be zero below some suciently low threshold. The center frequency u0 is connected to the number of pixels per cycle of the phase and the frequency standard deviation u is connected to the spatial and frequency support. By adjusting them, it is possible to get any number of phase cycles over the size of the spatial support. But all combinations of u0 and u do not yield a useful lter. To see this, suppose that a certain center frequency, u0 , is wanted. The LOCAL PHASE ESTIMATION 18 radius of the frequency support must then be smaller than u0 so that the frequency function is suciently low at u = 0, i.e. a negligible DC component. This gives an upper limit on the bandwidth, or rather frequency standard deviation, of the lter. Say, we allow the ratio between the DC component and the top value to be maximally PDC : Guu (0) P Gu u (u0 ) DC 0 0 (2.14) Using Equation (2.13) and resolving u gives the upper limit on the bandwidth: u p u0 ,2ln(P ) DC (2.15) See for instance Figure 2.7 on page 15, where the DC component is a few percent of the maximum value. Using the dual relationship between the frequency and spatial domains, = 1=u , it is possible to use inequality (2.15) as a lower limit of the spatial standard deviation: p DC ) = 1 ,2ln(P (2.16) u0 u The spatial support of a Gabor lter is innite just as the frequency support. A threshold, Pcut , must therefore be set in order to get a nite spatial size. The spatial radius, R, of the lter is then: kgu u (R)k P kgu u (0)k 0 cut 0 (2.17) Using Equation (2.12) and resolving R gives the lower limit on the lter radius: q R ,2ln(Pcut ) (2.18) Setting the standard deviation to its lower limit gives the lter radius as a function of the design parameters Pcut and PDC : p R = 2 ln(Pcut )ln(PDC ) u0 (2.19) 2.3 CHOICE OF FILTERS 19 The phase dierence between the end points of the lter can now be calculated: q = u0 R , u0 (,R) = 2u0 R = 4 ln(Pcut )ln(PDC ) (2.20) It should be pointed out that the truncation threshold, Pcut , aects the DC component of the lter. The DC component should therefore be checked after truncation of the lter to see if it is still less than PDC . This was not done in Table 2.1. Table 2.1 shows some values on the phase dierence between the end points of the lter. If the phase dierence is less than 2 the phase does not wrap around. PDC Pcut 0.05 0.1 0.2 0.25 0.05 11.982 10.505 8.783 8.151 0.1 10.505 9.210 7.700 7.146 0.2 8.783 7.700 6.437 5.974 0.25 8.151 7.146 5.974 5.545 Table 2.1: Phase dierence between lter end points for dierent values on the DC component and the truncation threshold. Both PDC and Pcut should be small. The DC value is not adjusted after truncation of the lter. Both PDC and Pcut should be small in order to minimize the DC component and keep the Gaussian envelope. The conclusion is that having all the support, or most of it, in the right half-plane of the frequency domain, i.e. a small PDC , requires a center frequency that generates wrap around of the phase. Similarly, it can be shown that beginning with a phase that does not wrap around yields a center frequency that is much smaller than the frequency support of the lter. The resulting lter will then have a substantial DC component. The upper limit of the relative bandwidth, , of the Gabor lters used has heuristically been set to approximately 0.8 octaves (Figure 2.8 on page 17). This is also the bandwidth used by Fleet et al [27, 39]. Langley suggests that the mean DC level should be LOCAL PHASE ESTIMATION 20 subtracted from the input images in order to enhance the results [51]. The reason is that the DC component of the lter then is less critical. The best would be to calculate the weighted average in every image point using the Gaussian envelope of the Gabor lter and subtract it from the original image, which is the same as constructing a new lter without DC component. The behavior of the Gabor lters around the singular points has been thoroughly investigated by Fleet et al [27]. They used a Gabor scalespace function dened as g(; ) = g()u () () (2.21) 0 where is the scale parameter. The center frequency decreases when the scale parameter increases, i.e. (2.22) u () = 2 0 In theory it would be possible to keep the absolute bandwidth constant, i.e. to xate u at the standard deviation used at the lowest u0 and then vary u0 . But by doing so, the number of phase cycles over the lter varies with the scale. If the relative bandwidth is kept constant, increasing can be seen as stretching out the same lter to cover larger areas [32]. Approximating the upper and lower half-height cuto frequencies as one standard deviation over and one under the center frequency, i.e. u = log2 uu0(()) + (2.23) , 0 u gives the expression for the spatial standard deviation of the lter. 1 () = u 1() 22 + ,1 0 ! (2.24) The isophase curves in Figure 2.9 on page 22 show the phase on a number of scales. The dark broad lines are due to phase wrap around. A feature that is stable in scale-space keeps its spatial position in all scales. If the 2.3 CHOICE OF FILTERS 21 phase was completely stable in scale, then the isophase pattern would only consist of vertical lines. The existence of singular points is easily observed in the phase diagram. The positions where the isophase curves converge are singular points. Just above them, the isophase curves turn downwards indicating areas with decreasing phase, i.e. negative local frequency, compare with Figure 2.5g. The high density of isophase curves just below the singular points shows that the local frequency is very high (Figure 2.5d). LOCAL PHASE ESTIMATION 22 Phase of Gabor filter output 4.5 4 3.5 Scale 3 2.5 2 1.5 1 0.5 0 50 100 150 4 4 3 3 2 1 2 50 100 150 200 0 250 50 100 150 200 250 a b Magnitude less 10% of maximum Magnitude less 5% of maximum 4 4 3 3 2 1 0 250 1 Scale Scale 0 200 Magnitude less 20% of maximum Scale Scale Magnitude Gabor filter output 2 1 50 100 150 c 200 250 0 50 100 150 200 250 d Figure 2.9: Above: Isophase plot of Gabor phase scale-space. The positions where the isophase curves converge are singular points. Below: Isomagnitude plot of Gabor phase scale-space. In the dark areas the magnitude is below the threshold. u0 = ( 4 ), and = 0:8 2.3 CHOICE OF FILTERS 23 2.3.3 Quadrature lters Quadrature lters can be dened as having no support in the left halfplane of the frequency domain, and no DC component. This denition makes them very easy to generate (Equations (2.5) and (2.6)). There are number of dierent types of quadrature lters of which two will be investigated here. Lognorm lter 1 Magnitude 0.8 0.6 0.4 0.2 0 −π −3π/4 −π/2 −π/4 0 π/4 π/2 3π/4 π Frequency Figure 2.10: Three lognorm lters in the frequency domain. f=8; =4; =2g and = 0:8 u0 = Lognorm lters are a class of quadrature lters used for orientation, phase and frequency estimation [44]. The design parameters are the center frequency u0 and the relative bandwidth in octaves, . The denition of the lognorm lters is in the frequency domain: ( , F (u) = e 0 4 log(2) log2 ( uu ) 0 if u > 0 otherwise (2.25) There is, by denition, no DC component or any support in the left halfplane of the frequency domain. Although an analytic expression of the spatial denition of a lognorm lter is unavailable, it possible to use some of the results from the Gabor lter case. For a certain relative bandwidth the phase goes through a certain number of cycles independent of the size of the lter support. Recalling, from the LOCAL PHASE ESTIMATION 24 Gabor case, that the center frequency is related to the number of pixels per phase cycle and the bandwidth is related to the size of the spatial support, it is evident that using a wide relative bandwidth is the same as ensuring no wrap around.The long tail of the lognorm frequency function makes this possible only for relatively low center frequencies (Figure 2.10 on the page before). If too much of the tail is cut, it can no longer be considered to be a lognorm lter. The isophase curves in Figure 2.11 on the facing page show the phase on a number of scales. It is generated using the same parameters as in the Gabor case above, i.e. u0 = =4 and = 0:8. The similarity makes it easy to identify the singular points and compare the behavior of the phase around them. Studying the behavior of the phase around the singular points indicates that the disturbance region is smaller than for Gabor lters, i.e. the areas with negative frequencies are smaller. On the other hand, the size of a lognorm lter is approximately 50 percent larger than a Gabor lter with the same center frequency and bandwidth when truncating on one percent of the maximal value. Powexp lter There is a type of quadrature lters where the number of cycles of the phase is directly controllable. A family of lters with center frequency u0 and a bandwidth controlled by can be constructed from the following standard Fourier pair [10]: F~ (u) = ( ue,u; if u > 0 0; if u 0 f~() = (1 + i1 )+1 (2.26) (2.27) Scaling with u0 gives a lter with a center frequency depending on . This dependence is of course unwanted and can be avoided by scaling with u0 = 2.3 CHOICE OF FILTERS 25 Phase of Lognormal filter output 4.5 4 3.5 Scale 3 2.5 2 1.5 1 0.5 0 50 100 150 4 4 3 3 2 1 2 50 100 150 200 0 250 50 100 150 200 250 a b Magnitude less 10% of maximum Magnitude less 5% of maximum 4 4 3 3 2 1 0 250 1 Scale Scale 0 200 Magnitude less 20% of maximum Scale Scale Magnitude of Lognormal filters 2 1 50 100 150 c 200 250 0 50 100 150 200 250 d Figure 2.11: Above: Isophase plot of Lognorm phase scale-space. The positions where the isophase curves converge are singular points. Below: Isomagnitude plot of lognorm phase scale-space. In the dark areas the magnitude is below the threshold. u0 = ( 4 ), and = 0:8 LOCAL PHASE ESTIMATION 26 instead: , u u F^u (u) = u u u e (2.28) 0 0 0 0 1 f^u () = 1 + i u 0 0 (2.29) +1 Finally, normalizing the frequency function gives the wanted Fourier pair: ^u (u) u ,( uu ,1) F Fu (u) = ^ = u e (2.30) Fu (u0 ) 0 0 0 0 0 1 , u e 1 + i u fu () = 0 1 0 +1 0 (2.31) Noting that Fu (u) = (F1u (u)) , it is easy to see that the center frequency of the lter is independent of and that the bandwidth will decrease with increasing . It is equally easy to see that the number of phase cycles of these lters is a function of . 0 0 For = 1, the relative bandwidth is approximately two octaves and the phase cycles once. However, both the frequency and spatial support of the lter are very large, which reduces the usefulness of the lter type. As an example the spatial support is approximately 60 percent larger than a lognorm lter with the same center frequency and bandwidth when truncating at one percent of the maximum value. 2.3 CHOICE OF FILTERS 27 2.3.4 Other even-odd pairs The lters described above all consist of an even real part and an odd imaginary part and the phase is calculated from the ratio between these parts. There are a few other types of lters that are neither Gabor nor quadrature lters, but which can be interpreted as odd-even pairs. Non-ringing lters 1 Magnitude 0.8 0.6 0.4 0.2 0 −π −3π/4 −π/2 −π/4 0 π/4 π/2 3π/4 π Frequency Figure 2.12: Three non-ringing lters. u f=8; =4; =2g and 2:2. The lters are generated in the spatial domain using Equation (2.35). The radii of spatial support are R = 14; 6; 3. A lter type that has exactly one phase cycle over the spatial support and that can be designed to have almost quadrature features has been suggested by Knutsson (personal communication). Using a monotonous antisymmetric phase function, a lter having a phase span of n2 and no DC component can be dened as: f () = g0 ()eiC g() (2.32) 0 where g0 () = dg = d ( > 0 if ,R R (2.33) n = 0; 1; 2; :: (2.34) 0 otherwise and C0 = gn , (R) LOCAL PHASE ESTIMATION 28 The function g( ) can be any monotonous antisymmetric function, but since the derivative controls the envelope, it is advisable to use a function with a smooth and unimodal derivative. How well such a lter approximates a quadrature lter depends on the size of the lter and how smooth the lter function is. It is easily shown that the DC component is zero, F (0) = Z1 1 f ()d = ZR ,R g0 ()ei ng() g(R) d = ei ng(R) g(R) , ei ng(,R) g(R) =0 Choosing n = 1 yields a lter with no DC component, and no wrap around. The isophase curves is calculated using the primitive function to a squared cosine as argument function and, thus, the squared cosine as envelope: ( ,i( +sin( R )) ; k k < R 2 f () = cos ( 2R )e R 0 otherwise (2.35) The center frequency is approximately 3=2R and the relative bandwidth is approximately 2:2 octaves. The phase of the lter is monotonous but the lter has a considerable support for negative frequencies. 2.3 CHOICE OF FILTERS 29 Phase of Non-ringing filter output 4.5 4 3.5 Scale 3 2.5 2 1.5 1 0.5 0 50 100 150 4 4 3 3 2 1 2 50 100 150 200 0 250 50 100 150 200 250 a b Magnitude less 10% of maximum Magnitude less 5% of maximum 4 4 3 3 2 1 0 250 1 Scale Scale 0 200 Magnitude less 20% of maximum Scale Scale Magnitude of Non-ringing filters 2 1 50 100 150 c 200 250 0 50 100 150 200 250 d Figure 2.13: Above: Isophaseplot of non-ringing phase scale-space. The positions where the isophase curves converge are singular points. Below: Isomagnitude plot of non-ringing phase scale-space. In the dark areas the magnitude is below the threshold. u0 = ( 4 ), LOCAL PHASE ESTIMATION 30 Windowed Fourier Transform 1 Magnitude 0.8 0.6 0.4 0.2 0 −π −3π/4 −π/2 −π/4 0 π/4 π/2 3π/4 π Frequency Figure 2.14: Three windowed Fourier transform lters. u = f2=15; 2=7; 2=4g and = 2:0. The lters where generated in the spatial domain using Equation (2.36). The radii of spatial support where R = 7; 3; 2. The windowed Fourier transform can used for estimating local phase. The window can be chosen arbitrarily, e.g. a rectangular function. Weng advocates the rectangular window [71], which is actually a special case of the non-ringing lters. The spatial magnitude function is a rectangular function and the argument is a ramp: ( ,i f () = e R ; k k < R 0 otherwise (2.36) Although the term windowed Fourier transform is not tied to any particular window function, lters dened according to Equation (2.36) are called WFT lters in this thesis. Figure 2.14 shows three WFT lters in the Fourier domain. The long tails of ripples make the lter sensitive to high-frequency noise. Weng suggests a preltering of the signal with a Gaussian, which is the same as a smoothing of the lter. The resulting lter is then very similar to the non-ringing lter above. The signal is preltered with a smoothing function as described in Subsection 2.3.1. Any further smoothing is therefore not necessary in this test. 2.3 CHOICE OF FILTERS 31 Phase of WFT filter output 4.5 4 3.5 Scale 3 2.5 2 1.5 1 0.5 0 50 100 150 4 4 3 3 2 1 2 50 100 150 200 0 250 50 100 150 200 250 a b Magnitude less 10% of maximum Magnitude less 5% of maximum 4 4 3 3 2 1 0 250 1 Scale Scale 0 200 Magnitude less 20% of maximum Scale Scale Magnitude of WFT filters 2 1 50 100 150 c 200 250 0 50 100 150 200 250 d Figure 2.15: Above: Isophaseplot of WFT phase scale-space. The positions where the isophase curves converge are singular points. Below: Isomagnitude plot of WFT phase scale-space. In the dark areas the magnitude is below the threshold. u0 = ( 4 ), LOCAL PHASE ESTIMATION 32 Gaussian dierences Gaussian lters and their derivatives, or rather dierences, can be eciently implemented using binomial lter kernels [18]. The basic kernels are a LP kernel and a dierence kernel (Figure 2.16). These can be implemented using shifts and summations. The rst and second dierence of the binomial Gaussians can be used as a phase estimator (Figure 2.16). The spatial support is only 5 pixels. Figure 2.16: Left, the LP kernel. Middle, the rst dierence kernel, f . Right, the second dierence kernel, f . 1 Magnitude 0.8 0.6 0.4 0.2 0 −π −3π/4 −π/2 −π/4 0 π/4 π/2 3π/4 π Frequency Figure 2.17: The DFT magnitude of the binomial phase lter for = 0:5 (dot-dashed), = 0:3333 (dashed) and = 0:3660 (solid). From the design, it is evident that there is no DC component and that the phase does not wrap around. The rst dierence lter, f , is the odd kernel, and changing sign on the second dierence, f , gives the even kernel. Instead of just using the kernels as they are, it is possible give 2.3 CHOICE OF FILTERS 33 them dierent relative weights, producing a range of lters: f () = ,f + i(1 , )f ; where 0 1 (2.37) The energy in the left half-plane of the frequency domain is minimized by setting = 0:3660 (Figure 2.17 on the preceding page). This design method is a special case of a method for producing quadrature lters called prolate spheroidals. In the general case, there are an arbitrary number of basis lters that are weighted together, using a multi-variable optimizing technique. The method produces the best possible quadrature lter of a given size in the sense that it has minimum energy in the left half plane. If it is essential to the implementation to use only summations and shifts, the weights can be chosen to 1 for f and 2 for f , corresponding to = 0:3333. The relative bandwidth is approximately two octaves and only slightly dependent on . LOCAL PHASE ESTIMATION 34 Phase of Gaussian differences filter output 4.5 4 3.5 Scale 3 2.5 2 1.5 1 0.5 0 50 100 150 4 4 3 3 2 1 2 50 100 150 200 0 250 50 100 150 200 250 a b Magnitude less 10% of maximum Magnitude less 5% of maximum 4 4 3 3 2 1 0 250 1 Scale Scale 0 200 Magnitude less 20% of maximum Scale Scale Magnitude of Gaussian differences 2 1 50 100 150 c 200 250 0 50 100 150 200 250 d Figure 2.18: Above: Isophase plot of Gaussian dierences phase scale- space. The positions where the isophase curves converge are singular points. Below: Isomagnitude plot of Gaussian derivatives phase scalespace. In the dark areas the magnitude is below the threshold. u0 = ( 2 ), 2.3 CHOICE OF FILTERS 35 2.3.5 Discussion on lter choice The choice of lter is not evident from these investigations. Dierent characteristics might have dierent priorities in dierent applications. The size of the kernel may be less important if special purpose hardware is used, making scale space behavior a critical issue. On the other hand, if the convolution time depends directly on the kernel size, a less robust but smaller kernel might be accepted. The most relevant test is to use the lters in the intended application and measure the overall performance. For convenience the lter characteristics are summed below. Gabor lters The Gabor lters might have a DC component if not designed carefully. The phase is monotonous but wraps around. The frequency support is localized in the right half-plane, and the sensitivity to singular points is small. Lognorm lters The lognorm quadrature lters have neither a DC component nor any frequency support in the left half-plane of the frequency domain. The phase generally wraps around, but it is monotonous. The sensitivity to singular points is small, for narrow band lters. The sensitivity increases with bandwidth. Non-ring lters The non-ring lter investigated here has no DC component, monotonous phase and no phase wrap around. The lter has a slight sensitivity for negative frequencies depending on the center frequency. The spatial support is small. The sensitivity to singular points is larger than for Gabor and lognorm lters. 36 LOCAL PHASE ESTIMATION (Rectangular) Window Fourier Transform Being a special case of the non-ring lters, the WFT lters share the properties described above. The sensitivity to singular points is the largest of the tested lters. The smoothing of the lter that is necessary to reduce the noise inuence makes the lter very similar to the non-ring lter based on the squared cosine magnitude. Dierence of Gaussians Gaussian derivatives lters implemented with binomial coecients do not have any DC component. The phase is monotonous and there is no phase wrap around. The sensitivity for negative frequencies can be adjusted by weighing the even and odd kernels appropriately. It can, however, not be reduced to zero. The sensitivity to singular points is slightly larger than for non-ring lters. The spatial support is small. 3 PHASE-BASED DISPARITY ESTIMATION 3.1 Introduction The problem of estimating depth information from two or more images of a scene is one which has received considerable attention over the years and a wide variety of methods have been proposed to solve it [8, 24]. Methods based on correlation and methods using some form of feature matching between the images have found most widespread use. Of these, the latter have attracted increasing attention since the work of Marr [54], in which the features are zero-crossings on varying scales. These methods share an underlying basis of spatial domain operations. In recent years, however, increasing interest has been shown in computational models of vision based primarily on a localized frequency domain representation - the Gabor representation [29, 2], rst suggested in the context of computer vision by Granlund [32]. In [63, 87, 40, 27, 51] it is shown that such a representation also can be adapted to the solution of the stereopsis problem. The basis for the success of these methods is the robustness of the local Gabor-phase dierences. The algorithm presented here is an extension of the work presented in [87]. 37 PHASE-BASED DISPARITY ESTIMATION 38 Intensity π Phase π/2 0 −π/2 ξ 1 ξ −π 2 Position ξ ξ 1 Position 2 ∆θ ∆ξ Figure 3.1: Left: A superimposed stereo image pair of a line. In the left image the line is located at 1 (solid) and in the right image it is located at 2 . Right: The phase curves corresponding to the line in the two images. The displacement can be estimated by calculating the phase dierence, , and the slope of the phase curve, i.e. the local frequency d=d . 3.2 Disparity estimation The fact that phase is locally equivariant with position can be used to estimate local displacement between two images [63, 87, 25, 72]. In a stereo image pair the local displacement is a measure of depth and in an image sequence the local displacement is an estimate of velocity. One of the advantages of using phase for displacement estimation is that subpixel accuracy can be obtained without having to change the sampling density. Figure 3.1 shows an example where the displacement of a line in a stereo image pair is estimated using phase dierences. Traditional displacement estimation would calculate the position of a signicant feature, e.g. the local maximum of the intensity, and then calculate the dierence. If subpixel accuracy is needed the feature locations would have to be stored using some sort of subpixel representation. The local phase, on the other hand, is a continuous variable sensitive to changes much smaller than the spatial quantization. Sampling the phase function with a certain density does not restrict the phase dierences 3.2 DISPARITY ESTIMATION 39 Shift Left image Edge detect. Right image Shift Shift Edge detect. D i s p a r i t y e s t i m a t i o n Spatial Spatial Cons. Cons. Accumulator Shift Figure 3.2: Computation structure for the hierarchical stereo algorithm. to the same accuracy. Thus, a subpixel displacement generates a phase shift giving phase dierences with subpixel accuracy without a subpixel representation of image features. In Figure 3.1 the displacement estimate is: = (3.1) 0 3.2.1 Computation structure A hierarchical stereo algorithm that uses a phase based disparity estimator has been developed [84]. To optimize the computational performance, a multiresolution representation of the left and right image is used. An edge detector, tuned to vertical structures, is used to produce a pair of images containing edge information. The edge images reduces the inuence of singular points since the singular points in the original images and the edge images generally do not coincide. The impact of a DC component in the disparity lter is also reduced be means of the edge images. The 40 PHASE-BASED DISPARITY ESTIMATION edge images together with the corresponding original image pair are used to build the resolution pyramids. It is one octave between the levels. The number of levels needed depends on the maximum disparity in the stereo image pair. The algorithm starts at the coarsest resolution. The disparity accumulator holds and updates disparity estimates and condence measures for each pixel. The four input images are shifted locally according to the current disparity estimates. After the shift, a new disparity estimate is calculated using the phase dierences, the local frequency and their condence values. The disparity estimate from the edge image pair has high condence close to edges, while the condence is low in between them. The estimates from the original image pair resolve possible problems of matching incompatible edges, that is, only edges with the same sign of the gradient should be matched. Both these disparity estimates are weighted together by a consistency function to form the disparity measure between the shifted images. The new disparity measure updates the current estimate in the disparity accumulator. For each resolution level a renement of the disparity estimate can be done by iterating these steps. It should get closer and closer to zero during the iterations. Between each level the disparity image is resampled to the new resolution, and a local spatial consistency check is performed. The steps above are repeated until the nest resolution is reached. The accumulator image then contains the nal disparity estimates and certainties. 3.2.2 Edge extraction Creating edge images can be done using any edge extraction algorithm. Here the edge extraction is performed using the same lter as for the disparity estimation. The magnitude of the lter response is stored in the edge image, creating a sort of line drawing. The disparity lters are sensitive only to more or less vertically oriented structures, but this is no limitation since horizontal lines does not contain any disparity information. The produced edge image is used as input to create a resolution pyramid in the same way as described above. There are a total of four pyramids that are generated before starting the disparity estimation. 3.2 DISPARITY ESTIMATION 41 3.2.3 Local image shifts The images from the current level in the resolution pyramid are shifted according to the disparity accumulator, which is initialized to zero. The left and right images are shifted half the distance each. The shift procedure decreases the disparity since the left and right images are shifted towards each other. It reduces dierences due to foreshortening as well [61]. This means that if a disparity is estimated fairly well at a coarse resolution, the reduction of the disparity will enable the next level to further rene the result. The shift is implemented as a \picking at a distance" procedure: xsL(1 ; 2) = xL(1 + 0:5; 2 ) xsR(1 ; 2) = xR (1 , 0:5; 2 ) (3.2) (3.3) which means that a value is picked from a the old image to the new image at a distance determined by the disparity, . This ensures that there will be no points without a value. Linear interpolation between neighboring pixels allow non-integer shifts. 3.2.4 Disparity estimation The disparity is measured on both the grey level images and the edge images. The phase can be estimated using any of the lters described in Section 2.3. The result will of course vary with the the lter characteristics, but a number of consistency checks reduces the variation between lter types. The disparity is estimated in the grey level images and the edge images separately and the results are weighted together. The lter responses in a point can be represented by a complex number. The real and imaginary parts of the complex number represents the even and odd lter responses respectively. The magnitude is a measure of how strong the signal is and how well it ts the lter. The magnitude will therefore be used as a condence measure of the lter response. The argument of the complex number is the phase in the signal. 42 PHASE-BASED DISPARITY ESTIMATION Let the responses from the phase estimating lter be represented with the complex numbers zL and zR for the left and right image respectively. The lters are normalized so that 0 kzL;R k 1. Calculating d = zLzR ; (3.4) where denotes the complex conjugate, yields a phase dierence measure and a condence value, kdk = kzL kkzR k; arg(d) = arg(zL ) , arg(zR ); 0 kdk 1 , arg(d) (3.5) (3.6) The magnitude, kdk, is large only if both lter magnitudes are large. It consequently indicates how reliable the phase dierence is. If a lter sees a completely homogeneous neighborhood, its magnitude is zero and its argument is undened. Calculating the phase dierence without any condence values then produces an arbitrary result. If the images are captured under similar conditions and they are covering approximately the same area, it is reasonable that the magnitudes of the lter responses are approximately the same for both images. This can be used to check the validity of the disparity estimate. A substantial dierence in magnitude can be due to noise or too large disparity, i.e. the image neighborhoods do not depict the same part of reality. It can also be due to a singular point in one of the signals, since the magnitude is reduced considerably in such neighborhoods. In any of these cases the condence value of the estimate should be reduced, so the consistency checks later on can weigh the estimate accordingly. Sanger used the ratio between the smaller and the larger of the magnitudes as a condence value [63]. Such a condence value does not dierentiate between strong and weak signals. The condence function below depends both on the relation between the lter magnitudes and the absolute value. The condence value therefore reects both the similarity and the signal strength: 2kzL zR k q (3.7) C1 = kzL zR k kz k2 + kz k2 L R The square root of kzL zR k is geometrical average between the lter magnitudes i.e. a measure on the combined signal strength. The exponent 3.2 DISPARITY ESTIMATION 43 1 0.8 0.6 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 3.3: The magnitude dierence penalty function. The plots show the function for 0 10 from left to right. The abscissa is the ratio between the smaller and the larger magnitude. controls how much a magnitude dierence should be punished. The expression within the parenthesis is equal to one if kzL k = kzR k and decays with increasing magnitude dierence. Setting M 2 = kzL zR k = kzR k=kzL k (3.8) (3.9) transforms Equation (3.7) into a more intuitively understandable form: 2 C = M 1+ 1 2 (3.10) If kzL k = kzR k = M , i.e. = 1, then C1 = M . This means that if the magnitudes are almost the same, the condence value is also the same. If the magnitudes dier, the condence goes down with a rate which is controlled by . Figure 3.3 shows how the condence depends on the lter magnitude ratio, , and the exponent . Throughout the testing of the algorithm the exponent has heuristically been set to 4. If the phase dierence is very large it might wrap around and indicate a disparity with the wrong sign. Very large phase dierences should therefore be given a lower condence value [87]. C2 = C1 cos 2 arg(d) 2 (3.11) PHASE-BASED DISPARITY ESTIMATION 44 In Chapter 2 it was shown that the phase derivative varies with the frequency content of the signal. In order to correctly interpret the phase dierence, = arg(d), as disparity it is necessary to estimate the phase derivative, i.e. the local frequency [51, 27]. Let z( ) be a phase estimate at position . The phase dierences between position and its two neighbors is a measure of how fast the phase varies in the neighborhood, i.e. the local frequency. The local frequency can be approximated using the the phase dierence to the left and right of the current position: fL, = zL ( , 1)zL () (3.12) fL+ = zL ()zL ( + 1) (3.13) fR, = zR ( , 1)zR () (3.14) fR+ = zR ()zR ( + 1) (3.15) The arguments of fi ; i 2 fL,; L+; R,; R+g are estimates of the local frequency, that are combined using 0 = arg (fL, + fL+ + fR, + fR+ ) (3.16) Knowing the local frequency, i.e. the slope of the phase curve, calculating the disparity in pixels is straight forward: (3.17) = arg(d) 0 Note that does not have to be an integer. Using phase dierences allows subpixel accuracy. The condence value is updated by a factor depending only on the similarity between the local frequency estimates and not on their magnitudes. If the local frequency is zero or negative the condence value is set to zero since the phase dierence then is completely unreliable. 8 P fi <C kfik if 0 > 0 C =: 0 if 0 0 2 3 1 4 4 where i 2 fL,;L+;R,;R+g (3.18) 3.2 DISPARITY ESTIMATION 45 3.2.5 Edge and grey level image consistency Let subscript g and e denote grey level and edge image values respectively. The disparity and condence values are calculated for the grey level image and the edge image separately using Equations (3.17) and (3.18). These estimates are then combined to give the total disparity estimate and its condence value. + Ce3 e = Cg3 Cg + (3.19) g3 Ce3 The condence value for the disparity estimate depends on Cg3 , Ce3 and the similarity between the phase dierences arg(dg ) and arg(de ). This is accomplished by adding the condence values as vectors with the phase dierences as arguments. Ctot = Cg ei 3 arg(d ) 2 g + Ce3 arg(d ) ei 2 e (3.20) The phase dierences, arg(de;g ) are divided by two in order to ensure that Ctot is large only for arg(dg ) arg(de ) and not for arg(dg ) arg(de ) 2 as well. 3.2.6 Disparity accumulation The disparity accumulator is updated using the disparity estimate and its condence value. The accumulator holds the cumulative sum of disparity estimates. Since the images are shifted according to the current accumulator value, the value to be added is just a correction towards the true disparity. Thus, the disparity value is simply added to the accumulator. new = old + (3.21) When updating the condence value of the accumulator, high condence values are emphasized and low values are attenuated. pC + pC ! old total Cnew = 2 2 (3.22) PHASE-BASED DISPARITY ESTIMATION 46 3.2.7 Spatial consistency In most images there are areas where the phase estimates are weak or contradictory. In these areas the disparity estimates are not reliable. This results in tearing the image apart when making the shift before disparity renement, creating unnecessary distortion of the image. It is then desirable to spread the estimates from nearby areas with higher condence values. On the other hand, it is not desirable to average between areas with dierent disparity and high condence. A lter function fullling these requirements has a spatial function with a large peak in the middle and then decays rapidly towards the periphery, such as a Gaussian with a small : h() = p12 e, 2 2 2 ,R R (3.23) A kernel with R = 7 and = 1:0 has been used when testing the algorithm. The lter is used in the vertical and horizontal direction separately. The lter is convolved with both the condence values alone and the disparity estimates weighted with the condence values. m=hC v = h C (3.24) (3.25) If the lter is positioned on a point with a high condence value, the disparity estimate will be left virtually untouched, but if the condence value is weak it changes towards the average of the neighborhood. The new disparity estimate and its condence value are Cnew = m new = mv (3.26) (3.27) After the spatial consistency operation, the accumulator is used to shift the input images either on the same level once more or on the next ner level, depending on how many iterations are used on each level. 3.3 EXPERIMENTAL RESULTS Filter Name Peak Freq. Non-ringing 7 =2 Non-ringing 11 3=10 Non-ringing 15 3=14 Gabor 1 =p 2 =(2 2) Gabor 2 Gabor 3 =4 Lognorm 1 =p 2 =(2p2) Lognorm 2a Lognorm 2b =(2 2) Lognorm 3a =4 Lognorm 3b =4 Lognorm 3c =4 WFT 5 2=5 WFT 7 2=7 WFT 11 2=11 Gaussian di. =2 47 Band- Filter width Size 2.2 2.2 2.2 0.8 0.8 0.8 1.0 1.0 2.0 1.0 2.0 4.0 2.0 2.0 2.0 2.0 7 11 15 15 17 19 13 15 15 19 19 19 5 7 11 5 Table 3.1: The lters used testing the phase based stereo algorithm. 3.3 Experimental results The algorithm has been tested both on synthetic test images and on real images, using a wide variety of lters (Table 3.1). All types of lters discussed in Section 2.3 are represented. Also a number of design parameter combinations have been tested. The lter based on dierences of binomial Gaussians was designed with = 1=3, (Equation (2.37)). The non-ringing lters were designed using Equation (2.32). The spatial size of these two types of lters are given by their denitions. Strictly speaking, the Gabor and lognorm lters have innite support and must be truncated to get a nite size. The size could be set large enough to make the truncation negligible, but this often gives very large lters. The criteria for setting the size of the Gabor and lognorm lters have been that the DC level must not be more than one percent of the value at the center frequency, and the envelope must have decreased to less than ten percent of the peak value. PHASE-BASED DISPARITY ESTIMATION 48 Original image LP Filter Shift image Local shift Statistics Stereo Disparity estimates Figure 3.4: The synthetic stereo pairs are generated using a shift image describing the local shifts. The image to be shifted is LP ltered in order to avoid aliasing. The shift image is also used as ground truth for the stereo estimates. 3.3.1 Generating stereo image pairs The quantitative results have been obtained by using synthetically generated images and comparing the estimated disparities with ground truth. The often used method of taking an image and simply shift it a few pixels in order to create a known disparity does not show the advantage of the local phase estimation. For such image pairs the dierence in global Fourier phase would do just as well. A method for evaluation of a disparity estimator must be based on locally varying shifts in order to resemble real life situations. A scheme for generation of locally varying disparity in a controlled fashion is found in Figure 3.4. The test image is shifted locally according to the values in a synthetically generated disparity image, which is also used as ground truth for evaluating the results. A global shift creates an image with the same properties as the original image at a certain distance. Local shifts deform the original image, stretching it in some areas and compressing it in other areas. The stretching does not create any problems but the compressing might do so. The 3.3 EXPERIMENTAL RESULTS 49 Figure 3.5: A noise image with an energy function inversely proportional to the radius in the frequency domain. spatial frequency increases when the image is compressed and it might exceed the Nyquist frequency. The image to be shifted is therefore LP ltered using a Gaussian ( = 1:273). In real life images, the structures in the image belong to real objects and so do the disparities. In synthetic test images, generated as shown in Figure 3.4 on the facing page, the disparities are not necessarily related to the image structure. A random image, e.g. white noise, random dots etc, is therefore best suited for testing purposes, since all parts of the image then exhibit a similar structure. The tests below use a noise image with PHASE-BASED DISPARITY ESTIMATION 50 Figure 3.6: Left: \Twin Peaks", one positive peak, one negative peak and zero around the edges. The magnitude changes linearly between the peaks and the boarder, both horizontally and vertically. Right: \Flip Flop", alternating positive and negative peaks, zero around the edges. The magnitude changes linearly from the edges towards the peaks, but has discontinuities between stripes with positive and negative values. a spatial frequency spectrum that is inversely proportional to the radius of the frequency domain (Figure 3.5 on the page before): kF (u)k ku1k (3.28) This is justied by the fact that it resembles a natural image more than for instance a white noise image [44]. The two dierent shift images that have been used in the tests are shown in Figure 3.6. \Twin Peaks" consists of one positive and one negative peak. The disparity is zero along the image edges and the magnitude changes linearly between the peaks and the border both vertically and horizontally. There are no discontinuities in the image. The other shift image, \Flip Flop", consists of alternating positive and negative peaks with exponentially increasing frequency upwards. It is also zero along the image edges. Horizontally, the magnitude changes linearly from the edges towards the peaks, but it has discontinuities vertically between stripes with positive and negative values. The shift values in the shift images 3.3 EXPERIMENTAL RESULTS 51 are normalized to the interval [,1; 1], and the maximum disparity is then controlled by a parameter to the shift module (Figure 3.4). 3.3.2 Statistics The result of the stereo algorithm is evaluated by measuring the mean and the standard deviation of the error between the shift image used to create the stereo pair and the disparity image. Since the algorithm also provides a condence value, the mean and standard deviation weighted with the condence values are also calculated. Let i denote the true disparity and let ~i denote the estimated disparity. The statistics are then calculated as: m = n1 n X i=1 (i , ~i ) s2 = n ,1 1 n X n X i=1 Ci(i , ~i) mw = i=1 X n n X s2w = i=1 (i , ~i , m)2 i=1 Ci Ci(i , ~i , mw )2 n X i=1 Ci (3.29) (3.30) (3.31) (3.32) The unweighted values furnish a measure of the how well the algorithm performs over the whole image. The weighted values, on the other hand, indicates how well the condence value reects the reliability of the measurements. If the condence value always is low when the disparity estimate is wrong, the weighted statistics show better values than the unweighted. 52 PHASE-BASED DISPARITY ESTIMATION If the disparity is captured at the coarsest level, the ner levels will rene the estimate to very high precision, while if the algorithm fails on a coarse resolution there is no way to recover. In the areas with too large disparity, the estimates are arbitrary. The algorithm is then likely to mismatch structures that accidently coincides due to insucient shifting of the images. As a consequence, it is hard to compare the statistics when using dierent shift images since the statistics are dependent on the ratio between the image area with measurable disparity and the area with too large disparity. The statistics diagrams should therefore be compared quantitatively only if they belong to the same shift image. The qualitative behavior is comparable for dierent shift images, though. The zero estimator is used for comparison. It always estimates zero disparity with full condence, i.e. the statistics for the zero estimator measure the mean and standard deviation of the shift image. 3.3.3 Increasing number of resolution levels A test where the maximum disparity is 10 pixels and the number of iterations on each level is one per level has been carried out. The number of resolution levels varies from one through six. An example of typical estimates and condence values is found in Figure 3.7. Note how the condence values decrease when the disparity estimation fails. Figures 3.8, 3.9 and 3.10 show the plots of the error standard deviation versus the number of resolution levels. Plots from tests using \Twin Peaks" and \Flip Flop" as disparity images are presented in the left and right columns respectively. Starting with the results from \Twin Peaks", one can see that most of the lters estimate the disparity accurately for the whole image if the number of levels is more than three. Some of the wide lters manage with only two levels, while a few other lters need four levels to reach the minimum error level. The plots for Gabor2, Gabor3 and WFT11 take o towards high values when the number of levels increases. This is due to large errors in a few regions of the image, and not to any general degradation of the estimates. The errors in these regions are probably 3.3 EXPERIMENTAL RESULTS 53 Figure 3.7: Disparity estimates, left, and their condence values, right, for one resolution level, above, and three resolution levels, below. The maximum disparity is 10 pixels, the number of iterations is one per level, and the lter is lognorm2b. Note how the condence decreases in neighborhoods were the disparity estimation fails. caused by singular points on the top level. An interesting observation is that the short lters, non-ring7 and binomial Gaussian dierences, have less error than the wider lters when the number of levels is high enough. Unfortunately, this is not true for all types of images, which is shown below. In the right columns of Figures 3.8, 3.9 and 3.10, none of the lters has successfully estimated the disparity in the whole image. This is due to the 54 PHASE-BASED DISPARITY ESTIMATION coarse-to-ne approach. In \Twin Peaks" there are two well separated areas, which do not interfere signicantly with each other as the image is LP ltered and subsampled. In \Flip Flop" the areas with dierent disparity are smaller and closer to each other. An LP ltering over suciently large regions will always collect information from areas with opposite disparities averaging to zero. Consequently, the interference within the lter cancels out the advantage of reaching over a greater distance. It is therefore natural that the wide lters generally perform better than the short ones. The generally poor performance of the windowed Fourier transform lters is separately discussed in Section 3.4. 3.3 EXPERIMENTAL RESULTS Non−ring 7 1.00 0.10 0.01 1 2 5 3 4 Non−ring 15 5 2 3 6 2 3 4 Non−ring 15 5 6 5 1.0 1 6 4 5 6 4 5 6 4 5 6 4 5 6 2 3 Gabor 1 10.0 Error std. dev. Error std. dev. 4 Gabor 1 1.00 0.10 2 3 4 5 1.0 1 6 Gabor 2 2 3 Gabor 2 10.0 Error std. dev. 10.0 Error std. dev. 5 Error std. dev. 2 10.0 1.00 0.10 2 3 4 5 1.0 1 6 Gabor 3 2 3 Gabor 3 10.0 Error std. dev. 10.0 1.00 0.10 0.01 1 3 4 Non−ring 11 10.0 0.10 0.01 1 1.0 1 6 1.00 0.01 1 2 10.0 0.10 0.01 1 1.0 1 6 1.00 10.0 Error std. dev. 3 4 Non−ring 11 Error std. dev. Error std. dev. 10.0 0.01 1 Non−ring 7 10.0 Error std. dev. Error std. dev. 10.0 Error std. dev. 55 2 3 4 5 6 1.0 1 2 3 Figure 3.8: Log of error standard deviation versus number of resolution levels. \Twin Peaks"(left) and \Flip Flop" (right) are used as shift images with 10 pixels maximum disparity. The zero estimator performance is indicated with dash-dotted lines. The solid curves correspond to the unweighted values, while the dashed curves correspond to the weighted values. PHASE-BASED DISPARITY ESTIMATION 56 Lognorm 1 1.00 0.10 0.01 1 2 2 3 4 Lognorm 3a 5 3 4 Lognorm 3b 5 3 4 Lognorm 3c 5 2 3 5 1.0 1 6 6 2 3 4 Lognorm 2b 5 6 2 3 4 Lognorm 3a 5 6 2 3 4 Lognorm 3b 5 6 2 3 4 Lognorm 3c 5 6 2 3 5 6 Error std. dev. 10.0 0.10 2 1.0 1 6 10.0 Error std. dev. 10.0 1.00 0.10 2 1.0 1 6 10.0 Error std. dev. 10.0 1.00 0.10 0.01 1 5 10.0 1.00 0.01 1 1.0 1 6 0.10 0.01 1 3 4 Lognorm 2a Error std. dev. Error std. dev. 5 2 10.0 1.00 10.0 Error std. dev. 3 4 Lognorm 2b 1.0 1 6 0.10 0.01 1 Error std. dev. 5 1.00 10.0 Error std. dev. 3 4 Lognorm 2a Error std. dev. Error std. dev. 10.0 0.01 1 Lognorm 1 10.0 Error std. dev. Error std. dev. 10.0 2 4 6 1.0 1 4 Figure 3.9: Log of error standard deviation versus number of resolution levels. \Twin Peaks"(left) and \Flip Flop" (right) are used as shift images with 10 pixels maximum disparity. The zero estimator performance is indicated with dash-dotted lines. The solid curves correspond to the unweighted values, while the dashed curves correspond to the weighted values. 3.3 EXPERIMENTAL RESULTS 6 2 3 4 5 Windowed Fourier Transform 11 6 2 3 4 5 Binomial Gaussian differences 6 2 6 0.10 10.0 1.0 1 10.0 1.00 0.10 0.01 1 1.0 1 10.0 1.00 0.01 1 2 3 4 5 Windowed Fourier Transform 7 6 2 3 4 5 Windowed Fourier Transform 11 6 2 3 4 5 Binomial Gaussian differences 6 2 6 Error std. dev. 10.0 0.10 10.0 1.0 1 Error std. dev. Error std. dev. 2 3 4 5 Windowed Fourier Transform 7 1.00 0.01 1 Windowed Fourier Transform 5 Error std. dev. 0.10 10.0 Error std. dev. 10.0 1.00 0.01 1 Error std. dev. Windowed Fourier Transform 5 Error std. dev. Error std. dev. 10.0 57 3 4 5 1.0 1 3 4 5 Figure 3.10: Log of error standard deviation versus number of resolution levels. \Twin Peaks"(left) and \Flip Flop" (right) are used as shift images with 10 pixels maximum disparity. The zero estimator performance is indicated with dash-dotted lines. The solid curves correspond to the unweighted values, while the dashed curves correspond to the weighted values. 58 PHASE-BASED DISPARITY ESTIMATION Figure 3.11: Disparity estimates, left, and their condence values, right, for ve pixel maximum disparity, above, and 20 pixel maximum disparity, below. The number of resolution levels and iterations were xed to three and one respectively, and the lter used was lognorm2b. Note how the condence decreases in neighborhoods where the disparity estimation fails. 3.3.4 Increasing maximum disparity A test where the number of resolution levels and iterations on each level is xed while the maximum disparity increases, has been carried out using both \Flip Flop" and \Twin Peaks". The maximum disparity increases from 1 to 85 pixels and the number of levels and iterations is ve and one respectively. An example of typical estimates and condence values is 3.3 EXPERIMENTAL RESULTS 59 found in Figure 3.11. Note how the condence values decrease when the disparity estimation fails. Figures 3.12, 3.13 and 3.14 show the plots of the Log of error standard deviation versus maximum disparity. Again, the results corresponding to \Twin Peaks" are presented in the left columns while the right columns are results corresponding to \Flip Flop". The \Flip Flop" results are of marginal value since they approach the zero estimator rapidly, but they are printed for completeness. Most of the curves have a \knee" where the error standard deviation rapidly increases with a factor of ten. The disparity where this occurs is the maximum reachable disparity for each lter. When the disparity is larger than this the highest level fails and no recovery is possible. Naturally, the wide lters with low peak frequency reach longer and therefore show the best results in this test. PHASE-BASED DISPARITY ESTIMATION 60 Non−ring 7 10.0 1.00 0.10 0.01 0 10 20 20 0.10 20 30 60 70 1.00 0.10 10 20 30 60 70 1.00 0.10 10 20 30 60 70 1.00 0.10 10 20 30 40 50 20 30 40 50 60 Non−ring 15 70 80 10 20 30 40 50 Gabor 1 60 70 80 10 20 30 40 50 Gabor 2 60 70 80 10 20 30 40 50 Gabor 3 60 70 80 10 20 30 60 70 80 10.0 1.00 0.10 10.0 1.00 0.10 10.0 1.00 0.10 100 10.0 0.01 0 10 0.10 0.01 0 80 Error std. dev. 100 40 50 Gabor 3 80 100 10.0 0.01 0 70 1.00 0.01 0 80 Error std. dev. 100 40 50 Gabor 2 30 40 50 60 Non−ring 11 100 10.0 0.01 0 20 10.0 0.01 0 80 Error std. dev. Error std. dev. 100 40 50 Gabor 1 10 100 1.00 10 0.10 0.01 0 80 10.0 0.01 0 Error std. dev. 70 Error std. dev. 30 40 50 60 Non−ring 15 1.00 100 0.10 10 10.0 0.01 0 80 1.00 100 Error std. dev. 70 10.0 0.01 0 Error std. dev. 30 40 50 60 Non−ring 11 Error std. dev. Error std. dev. 100 Non−ring 7 100 Error std. dev. Error std. dev. 100 60 70 80 10.0 1.00 0.10 0.01 0 40 50 Figure 3.12: Log of error standard deviation versus maximum disparity using \Twin Peaks"(left) and \Flip Flop" (right) as shift images. The zero estimator performance is indicated with dash-dotted lines. The solid curves correspond to the unweighted values, while the dashed curves correspond to the weighted values. 3.3 EXPERIMENTAL RESULTS Lognorm 1 10.0 1.00 0.10 10 20 20 0.10 20 70 1.00 0.10 10 20 70 1.00 0.10 10 20 70 1.00 0.10 10 20 30 40 50 60 20 30 40 50 60 Lognorm 2b 70 80 10 20 30 40 50 60 Lognorm 3a 70 80 10 20 30 40 50 60 Lognorm 3b 70 80 10 20 30 40 50 60 Lognorm 3c 70 80 10 20 30 70 80 10.0 1.00 0.10 10.0 1.00 0.10 10.0 1.00 0.10 100 10.0 0.01 0 10 0.10 0.01 0 80 Error std. dev. 100 30 40 50 60 Lognorm 3c 80 100 10.0 0.01 0 70 1.00 0.01 0 80 Error std. dev. 100 30 40 50 60 Lognorm 3b 30 40 50 60 Lognorm 2a 100 10.0 0.01 0 20 10.0 0.01 0 80 Error std. dev. 100 30 40 50 60 Lognorm 3a 10 100 1.00 10 0.10 0.01 0 80 Error std. dev. Error std. dev. 70 10.0 0.01 0 Error std. dev. 30 40 50 60 Lognorm 2b 1.00 100 0.10 10 10.0 0.01 0 80 1.00 100 Error std. dev. 70 10.0 0.01 0 Error std. dev. 30 40 50 60 Lognorm 2a Error std. dev. Error std. dev. 100 Lognorm 1 100 Error std. dev. Error std. dev. 100 0.01 0 61 70 80 10.0 1.00 0.10 0.01 0 40 50 60 Figure 3.13: Log of error standard deviation versus maximum disparity using \Twin Peaks"(left) and \Flip Flop" (right) as shift images. The zero estimator performance is indicated with dash-dotted lines. The solid curves correspond to the unweighted values, while the dashed curves correspond to the weighted values. PHASE-BASED DISPARITY ESTIMATION 62 Windowed Fourier Transform 5 10.0 1.00 0.10 0.01 0 10 1.00 0.10 10 1.00 0.10 10 1.00 0.10 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 Windowed Fourier Transform 11 80 10 20 30 40 50 60 70 Binomial Gaussian differences 80 1.00 0.10 10.0 1.00 0.10 100 10.0 0.01 0 20 30 40 50 60 70 Windowed Fourier Transform 7 10.0 0.01 0 80 Error std. dev. Error std. dev. 100 20 30 40 50 60 70 Binomial Gaussian differences 10 100 10.0 0.01 0 0.10 0.01 0 80 Error std. dev. Error std. dev. 100 20 30 40 50 60 70 Windowed Fourier Transform 11 1.00 100 10.0 0.01 0 10.0 0.01 0 80 Error std. dev. Error std. dev. 100 20 30 40 50 60 70 Windowed Fourier Transform 7 Windowed Fourier Transform 5 100 Error std. dev. Error std. dev. 100 80 10.0 1.00 0.10 0.01 0 10 20 30 40 50 60 70 Figure 3.14: Log of error standard deviation versus maximum disparity using \Twin Peaks"(left) and \Flip Flop" (right) as shift images. The zero estimator performance is indicated with dash-dotted lines. The solid curves correspond to the unweighted values, while the dashed curves correspond to the weighted values. 80 3.3 EXPERIMENTAL RESULTS 63 Figure 3.15: A natural image used for testing the benets of combin- ing results from line and grey level images. (courtesy to CVAP, Royal Institute of Technology, Stockholm.) 3.3.5 Combining line and grey level results In order to see the benets of using both grey level images and line images the experiments in Subsection 3.3.3 where repeated using grey level images only and line images only. However, the structure of the noise image is not ideal for showing how the line images contribute to the overall performance, since there are no extended lines or edges. Instead, a natural image, shown in Figure 3.15, is used as original image in the scheme in Figure 3.4 on page 48. Figure 3.16 shows the results from the lognorm3c lter when a natural image is used. Each row shows output certainty, the disparity estimate and the error image. The rst row corresponds to using both grey and line images, the second to grey level images only, and the third to line images only. Note how the combined results benets from both the other results. Where the line images fails the grey level image succeeds and vice versa. Figures 3.17, 3.18 and 3.19 show the curves corresponding to the test in Subsection 3.3.3 with the image in Figure 3.15 instead of the noise image in Figure 3.5 on page 49. A general observation is that the minimum 64 PHASE-BASED DISPARITY ESTIMATION Figure 3.16: Top: The output certainty, the disparity estimate and the error image from using both grey level images and line images. Middle: The same using grey level image only. Bottom: The same using line image only. Note how the combined results benets from both the other results. Where the line images fail the grey level images succeed and vice versa. 3.3 EXPERIMENTAL RESULTS 65 error levels are slightly higher using the natural image than when using the noise image. This is due to the spatial consistency operation that spreads the estimates from edges with high certainty into areas with weak image structure. Figures 3.20, 3.21 and 3.22 show the error plots corresponding to using grey level images only. The results from using line images only are shown in Figures 3.23, 3.24 and 3.25. The dierences might at a rst glance look minimal but it should be kept in mind that the curves depict statistics for the full image and that the error can be large locally, cf. Figure 3.16. To point out a few interesting results, compare the nonring7 curves in the left columns of Figures 3.17, 3.20 and 3.23. When using three levels of resolution, the error from the combined result is lower than for any of the other two. On the other hand, when the number of resolution levels is high enough the dierence is very small. This might seem disappointing, but is due to the fact that \Twin Peaks" is an nice eld without discontinuities. The results corresponding to the \Flip Flop" image show that for most of the lters, combining grey level image and line image estimates is preferable. In particular, the lters with a low center frequency, e.g. nonring15 and lognorm3c, benet from combining the estimates. PHASE-BASED DISPARITY ESTIMATION 66 Non−ring 7 1.00 0.10 0.01 1 2 5 3 4 Non−ring 15 5 2 3 6 2 3 4 Non−ring 15 5 6 5 1.0 1 6 4 5 6 4 5 6 4 5 6 4 5 6 2 3 Gabor 1 10.0 Error std. dev. Error std. dev. 1.00 0.10 2 3 4 5 1.0 1 6 Gabor 2 2 3 Gabor 2 10.0 Error std. dev. Error std. dev. 4 Gabor 1 10.0 1.00 0.10 2 3 4 5 1.0 1 6 Gabor 3 2 3 Gabor 3 10.0 Error std. dev. 10.0 Error std. dev. 5 Error std. dev. 2 10.0 1.00 0.10 0.01 1 3 4 Non−ring 11 10.0 0.10 0.01 1 1.0 1 6 1.00 0.01 1 2 10.0 0.10 0.01 1 1.0 1 6 1.00 10.0 Error std. dev. 3 4 Non−ring 11 Error std. dev. Error std. dev. 10.0 0.01 1 Non−ring 7 10.0 Error std. dev. Error std. dev. 10.0 2 3 4 5 6 1.0 1 2 3 Figure 3.17: Log of error standard deviation versus number of resolution levels using both grey level and line images. \Twin Peaks"(left) and \Flip Flop" (right) are used as shift images with 10 pixels maximum disparity. The zero estimator performance is indicated with dash-dotted lines. The solid curves correspond to the unweighted values, while the dashed curves correspond to the weighted values. 3.3 EXPERIMENTAL RESULTS Lognorm 1 1.00 0.10 0.01 1 2 5 3 4 Lognorm 3a 5 3 4 Lognorm 3b 5 2 3 4 Lognorm 3c 5 2 3 5 5 6 2 3 4 Lognorm 2b 5 6 2 3 4 Lognorm 3a 5 6 2 3 4 Lognorm 3b 5 6 2 3 4 Lognorm 3c 5 6 2 3 5 6 10.0 1.0 1 6 10.0 Error std. dev. 10.0 1.00 0.10 2 1.0 1 6 10.0 Error std. dev. 10.0 1.00 0.10 2 1.0 1 6 10.0 Error std. dev. 10.0 1.00 0.10 0.01 1 1.0 1 6 0.10 0.01 1 3 4 Lognorm 2a 10.0 1.00 0.01 1 2 Error std. dev. Error std. dev. 3 4 Lognorm 2b 1.0 1 6 0.10 0.01 1 Error std. dev. 5 1.00 10.0 Error std. dev. 3 4 Lognorm 2a Error std. dev. Error std. dev. 10.0 0.01 1 Lognorm 1 10.0 Error std. dev. Error std. dev. 10.0 Error std. dev. 67 2 4 6 1.0 1 4 Figure 3.18: Log of error standard deviation versus number of resolution levels using both grey level and line images. \Twin Peaks"(left) and \Flip Flop" (right) are used as shift images with 10 pixels maximum disparity. The zero estimator performance is indicated with dash-dotted lines. The solid curves correspond to the unweighted values, while the dashed curves correspond to the weighted values. PHASE-BASED DISPARITY ESTIMATION 68 6 2 3 4 5 Windowed Fourier Transform 11 6 2 3 4 5 Binomial Gaussian differences 6 2 6 0.10 10.0 1.0 1 10.0 1.00 0.10 0.01 1 1.0 1 10.0 1.00 0.01 1 2 3 4 5 Windowed Fourier Transform 7 6 2 3 4 5 Windowed Fourier Transform 11 6 2 3 4 5 Binomial Gaussian differences 6 2 6 Error std. dev. 10.0 0.10 10.0 1.0 1 Error std. dev. Error std. dev. 2 3 4 5 Windowed Fourier Transform 7 1.00 0.01 1 Windowed Fourier Transform 5 Error std. dev. 0.10 10.0 Error std. dev. 10.0 1.00 0.01 1 Error std. dev. Windowed Fourier Transform 5 Error std. dev. Error std. dev. 10.0 3 4 5 1.0 1 3 4 5 Figure 3.19: Log of error standard deviation versus number of resolution levels using both grey level and line images. \Twin Peaks"(left) and \Flip Flop" (right) are used as shift images with 10 pixels maximum disparity. The zero estimator performance is indicated with dash-dotted lines. The solid curves correspond to the unweighted values, while the dashed curves correspond to the weighted values. 3.3 EXPERIMENTAL RESULTS Non−ring 7 1.00 0.10 0.01 1 2 5 3 4 Non−ring 15 5 2 3 6 2 3 4 Non−ring 15 5 6 5 1.0 1 6 4 5 6 4 5 6 4 5 6 4 5 6 2 3 Gabor 1 10.0 Error std. dev. Error std. dev. 4 Gabor 1 1.00 0.10 2 3 4 5 1.0 1 6 Gabor 2 2 3 Gabor 2 10.0 Error std. dev. 10.0 Error std. dev. 5 Error std. dev. 2 10.0 1.00 0.10 2 3 4 5 1.0 1 6 Gabor 3 2 3 Gabor 3 10.0 Error std. dev. 10.0 1.00 0.10 0.01 1 3 4 Non−ring 11 10.0 0.10 0.01 1 1.0 1 6 1.00 0.01 1 2 10.0 0.10 0.01 1 1.0 1 6 1.00 10.0 Error std. dev. 3 4 Non−ring 11 Error std. dev. Error std. dev. 10.0 0.01 1 Non−ring 7 10.0 Error std. dev. Error std. dev. 10.0 Error std. dev. 69 2 3 4 5 6 1.0 1 2 3 Figure 3.20: Log of error standard deviation versus number of resolution levels using only grey level images. \Twin Peaks"(left) and \Flip Flop" (right) are used as shift images with 10 pixels maximum disparity. The zero estimator performance is indicated with dash-dotted lines. The solid curves correspond to the unweighted values, while the dashed curves correspond to the weighted values. PHASE-BASED DISPARITY ESTIMATION 70 Lognorm 1 1.00 0.10 0.01 1 2 2 3 4 Lognorm 3a 5 3 4 Lognorm 3b 5 3 4 Lognorm 3c 5 2 3 5 1.0 1 6 6 2 3 4 Lognorm 2b 5 6 2 3 4 Lognorm 3a 5 6 2 3 4 Lognorm 3b 5 6 2 3 4 Lognorm 3c 5 6 2 3 5 6 Error std. dev. 10.0 0.10 2 1.0 1 6 10.0 Error std. dev. 10.0 1.00 0.10 2 1.0 1 6 10.0 Error std. dev. 10.0 1.00 0.10 0.01 1 5 10.0 1.00 0.01 1 1.0 1 6 0.10 0.01 1 3 4 Lognorm 2a Error std. dev. Error std. dev. 5 2 10.0 1.00 10.0 Error std. dev. 3 4 Lognorm 2b 1.0 1 6 0.10 0.01 1 Error std. dev. 5 1.00 10.0 Error std. dev. 3 4 Lognorm 2a Error std. dev. Error std. dev. 10.0 0.01 1 Lognorm 1 10.0 Error std. dev. Error std. dev. 10.0 2 4 6 1.0 1 4 Figure 3.21: Log of error standard deviation versus number of resolution levels using only grey level images. \Twin Peaks"(left) and \Flip Flop" (right) are used as shift images with 10 pixels maximum disparity. The zero estimator performance is indicated with dash-dotted lines. The solid curves correspond to the unweighted values, while the dashed curves correspond to the weighted values. 3.3 EXPERIMENTAL RESULTS 6 2 3 4 5 Windowed Fourier Transform 11 6 2 3 4 5 Binomial Gaussian differences 6 2 6 0.10 10.0 1.0 1 10.0 1.00 0.10 0.01 1 1.0 1 10.0 1.00 0.01 1 2 3 4 5 Windowed Fourier Transform 7 6 2 3 4 5 Windowed Fourier Transform 11 6 2 3 4 5 Binomial Gaussian differences 6 2 6 Error std. dev. 10.0 0.10 10.0 1.0 1 Error std. dev. Error std. dev. 2 3 4 5 Windowed Fourier Transform 7 1.00 0.01 1 Windowed Fourier Transform 5 Error std. dev. 0.10 10.0 Error std. dev. 10.0 1.00 0.01 1 Error std. dev. Windowed Fourier Transform 5 Error std. dev. Error std. dev. 10.0 71 3 4 5 1.0 1 3 4 5 Figure 3.22: Log of error standard deviation versus number of resolution levels using only grey level images. \Twin Peaks"(left) and \Flip Flop" (right) are used as shift images with 10 pixels maximum disparity. The zero estimator performance is indicated with dash-dotted lines. The solid curves correspond to the unweighted values, while the dashed curves correspond to the weighted values. PHASE-BASED DISPARITY ESTIMATION 72 Non−ring 7 1.00 0.10 0.01 1 2 5 3 4 Non−ring 15 5 2 3 6 2 3 4 Non−ring 15 5 6 5 1.0 1 6 4 5 6 4 5 6 4 5 6 4 5 6 2 3 Gabor 1 10.0 Error std. dev. Error std. dev. 1.00 0.10 2 3 4 5 1.0 1 6 Gabor 2 2 3 Gabor 2 10.0 Error std. dev. Error std. dev. 4 Gabor 1 10.0 1.00 0.10 2 3 4 5 1.0 1 6 Gabor 3 2 3 Gabor 3 10.0 Error std. dev. 10.0 Error std. dev. 5 Error std. dev. 2 10.0 1.00 0.10 0.01 1 3 4 Non−ring 11 10.0 0.10 0.01 1 1.0 1 6 1.00 0.01 1 2 10.0 0.10 0.01 1 1.0 1 6 1.00 10.0 Error std. dev. 3 4 Non−ring 11 Error std. dev. Error std. dev. 10.0 0.01 1 Non−ring 7 10.0 Error std. dev. Error std. dev. 10.0 2 3 4 5 6 1.0 1 2 3 Figure 3.23: Log of error standard deviation versus number of resolution levels using only line images. \Twin Peaks"(left) and \Flip Flop" (right) are used as shift images with 10 pixels maximum disparity. The zero estimator performance is indicated with dash-dotted lines. The solid curves correspond to the unweighted values, while the dashed curves correspond to the weighted values. 3.3 EXPERIMENTAL RESULTS Lognorm 1 1.00 0.10 0.01 1 2 5 3 4 Lognorm 3a 5 3 4 Lognorm 3b 5 2 3 4 Lognorm 3c 5 2 3 5 5 6 2 3 4 Lognorm 2b 5 6 2 3 4 Lognorm 3a 5 6 2 3 4 Lognorm 3b 5 6 2 3 4 Lognorm 3c 5 6 2 3 5 6 10.0 1.0 1 6 10.0 Error std. dev. 10.0 1.00 0.10 2 1.0 1 6 10.0 Error std. dev. 10.0 1.00 0.10 2 1.0 1 6 10.0 Error std. dev. 10.0 1.00 0.10 0.01 1 1.0 1 6 0.10 0.01 1 3 4 Lognorm 2a 10.0 1.00 0.01 1 2 Error std. dev. Error std. dev. 3 4 Lognorm 2b 1.0 1 6 0.10 0.01 1 Error std. dev. 5 1.00 10.0 Error std. dev. 3 4 Lognorm 2a Error std. dev. Error std. dev. 10.0 0.01 1 Lognorm 1 10.0 Error std. dev. Error std. dev. 10.0 Error std. dev. 73 2 4 6 1.0 1 4 Figure 3.24: Log of error standard deviation versus number of resolution levels using only line images. \Twin Peaks"(left) and \Flip Flop" (right) are used as shift images with 10 pixels maximum disparity. The zero estimator performance is indicated with dash-dotted lines. The solid curves correspond to the unweighted values, while the dashed curves correspond to the weighted values. PHASE-BASED DISPARITY ESTIMATION 74 6 2 3 4 5 Windowed Fourier Transform 11 6 2 3 4 5 Binomial Gaussian differences 6 2 6 0.10 10.0 1.0 1 10.0 1.00 0.10 0.01 1 1.0 1 10.0 1.00 0.01 1 2 3 4 5 Windowed Fourier Transform 7 6 2 3 4 5 Windowed Fourier Transform 11 6 2 3 4 5 Binomial Gaussian differences 6 2 6 Error std. dev. 10.0 0.10 10.0 1.0 1 Error std. dev. Error std. dev. 2 3 4 5 Windowed Fourier Transform 7 1.00 0.01 1 Windowed Fourier Transform 5 Error std. dev. 0.10 10.0 Error std. dev. 10.0 1.00 0.01 1 Error std. dev. Windowed Fourier Transform 5 Error std. dev. Error std. dev. 10.0 3 4 5 1.0 1 3 4 5 Figure 3.25: Log of error standard deviation versus number of resolution levels using only line images. \Twin Peaks"(left) and \Flip Flop" (right) are used as shift images with 10 pixels maximum disparity. The zero estimator performance is indicated with dash-dotted lines. The solid curves correspond to the unweighted values, while the dashed curves correspond to the weighted values. 3.3 EXPERIMENTAL RESULTS 75 Figure 3.26: Above: Left and right image, (captured using the 'Getax' robot head at Department of Electronic and Electrical Engineering at University of Surrey, UK). Lower left: Disparity estimates threshold using the condence values. Lower right: Condence values. Note that the condence values are strong on image structures and weak on at surfaces. The result is obtained with ve resolution levels and two iterations on each level. The lter used is nonring7. 3.3.6 Results on natural images Tests on real life images give similar results but the performance is harder to quantify since the true disparity is almost always unknown. Three examples are shown in Figures 3.26, 3.27 and 3.28. Note that the condence values are strong on image structures and weak on at surfaces. 76 PHASE-BASED DISPARITY ESTIMATION Figure 3.27: Above: Left and right image, (captured using the 'Getax' robot head at Department of Electronic and Electrical Engineering at University of Surrey, UK). Lower left: Disparity estimates threshold using the condence values. Lower right: Condence values. Note that the condence values are strong on image structures and weak on at surfaces. The result is obtained with ve resolution levels and two iterations on each level. The lter used is nonring7. 3.3 EXPERIMENTAL RESULTS Figure 3.28: Two images from the Sarno tree sequence. Above: Left and right image. Lower left: Disparity estimates threshold using the condence values. Lower right: Condence values. Note that the condence values are strong on image structures and weak on at surfaces. The result is obtained with four resolution levels and two iterations on each level. The lter used is nonring7. 77 78 PHASE-BASED DISPARITY ESTIMATION 3.4 Conclusion The test results show that the overall performance of the stereo algorithm is not critically dependent on the type of disparity lter used. The consistency checks and edge image ltering applied in order to enhance the performance of the algorithm, do indeed reduce the impact of the actual lter shape. There are, however, some conclusions to be made. The results of the phase based stereo algorithm in Section 3.3, are somewhat contradictory to the results of the investigation of the singular points in Section 2.3. The stereo tests indicate that lters with a wide bandwidth are preferable, while a narrow bandwidth is more advantageous from a phase scale space point of view. The reason is that broad band lters have less phase cycles and therefore can handle greater disparities for a given lter size. The computational eciency that follows from this implies that the best lter choice is the lters without wrap around; Dierence-of-Gaussians are very small and thus requires more levels of resolution, while the non-ringing lters are larger and manages with a smaller number of levels. As pointed out earlier, it is not always possible to compensate small size with by increasing the number of levels. The poor performance of the WFT lters is due to the rectangular window. Such a window gives image structures in the lter periphery the same importance as one close to the lter center. The frequency domain representation of the lters show that they are sensitive to high-frequency noise due to the long tails of ripples (Figure 2.14 on page 30). As mentioned before, Weng suggests a low pass preltering of the input image [71]. The resulting lter is very close to the nonring lters and so are the corresponding results. The reason for using the non-preltered version of WFT here is to show that the preltering is not a little adjustment to get rid of little noise; it is crucial for the operation to work. 3.5 FURTHER RESEARCH 79 3.5 Further research The problem with lter interference mentioned in Subsection 3.3.3 can be reduced by introducing a correlation stage in the computational structure. A correlation stage means that instead of computing phase dierences pointwise only, it is done over a neighborhood, and the dierence value with the highest condence is used as disparity estimate. On a certain level, disparities larger that the lter support can be captured. Larger disparities can then be estimated for a given lter and a given number of resolution levels. The computational cost is of course higher, especially if the correlation neighborhood is large. A compromise is to use correlation at the coarsest resolution only, since correct estimates there reduces the disparities at lower levels [30, 31]. All real stereo pairs have more or less vertical disparities as well as horizontal, due to camera geometry etc. To be able to handle general cases, further work will include extending the algorithm to two dimensions, estimating vertical disparities as well. This can be implemented either by interleaving two one-dimensional algorithms, one horizontal and one vertical, or by using two-dimensional lters in the disparity estimator. A method for two-dimensional disparities based on the phase in the Multiresolution Fourier Transform, MFT, has been developed by Calway, Knutsson and Wilson [17, 16]. Being able to handle multiple disparities in the same image point is also a potentially useful extension. See for instance [70] for a multiple motion approach to motion estimation. 80 PHASE-BASED DISPARITY ESTIMATION 4 HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION 4.1 Introduction A fundamental problem yet to be solved in computer vision in general and active vision in particular is that of how to focus the attention of a system. The issues involved in focus of attention (FOA) incorporate not only where to look next but also more abstract mechanisms such as how to concentrate on certain features in the continuous ow of input data. There are several reasons for narrowing the channel of input information by focusing on specic parts. The most obvious reason for FOA is to reduce the vast amount of input data to match the available computational resources. However, the ability to decompose a complex problem into simpler subproblems has also been put forward as a major motivation for using focus of attention mechanisms. 4.1.1 Human focus of attention Humans can shift the attention either by moving the xation point or by concentrating on a part of the eld of view. The two types are called overt and covert attention respectively. The covert attention shifts are about four times as fast as the overt shifts. This speed dierence can be used to check a potential xation point to see if it is worthwhile moving the gaze to that position. 81 82 HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION A number of paradigms describing human focus of attention has been developed over the years [57]. In the zoom-lens metaphor computational resources can either be spread on the whole eld of view, 'wide angle lens', or concentrated on a portion, 'telephoto lens'. This metaphor is founded on the assumption that the computational resources are the main cause for having to focus the attention on one thing at the time. Thus, the problem is to allocate the available resources properly. The work presented below relates to the search light metaphor [41]. A basic assumption in this metaphor is the division between preattentive and attentive perception. The idea is that the preattentive part of the system makes a crude analysis of the eld of view. The attentive part then analyzes areas indicated as being particularly interesting more closely. The two systems should not be seen as taking turns in a time multiplex manner, but rather as a pipeline where the attentive part makes selective use of the continuous stream of results from the preattentive part. The term 'search light' reects how the attentive system analyzes parts of the information available by illuminating it with an attentional search light. The reason for having to focus the attention in this metaphor is that some tasks are inherently of sequential nature. What features or properties are then important for positioning the xation point? For attentional shifts the criterion is closely connected to the task at hand. Yarbus pioneered the work on studying how human move the xation point in images depending on the wanted information [88]. For pre-attentional shifts gradients in space and time, i.e high contrast areas or motion, are considered to be the important features. Abbott and Ahuja present a list of criteria for the choice of the next xation point [1]. Many of the items in the list relate to computational considerations concerning the surface reconstruction algorithm presented. However, a few clues from human visual behavior was also included: Absolute distance and direction. If multiple candidates for xation points are present, the ones closer to the center of the viewing eld are more likely to be chosen. Upward movement is generally preferred to downward movement. 4.1 INTRODUCTION 83 2D images characteristics. If polygonal objects are presented, points close to corners are likely to be chosen as xation points. When symmetries are present, the xation point tends to be chosen along symmetry lines. Temporal changes. When peripheral stimuli suddenly appear, a strong temporal cue often leads to a movement of xation point towards the stimuli. Since xation point selection is a highly task dependent action, it is probably easy to construct situations that contradict the list above. The reader is urged to consult for the appropriate references in order to get a full description of how the results were obtained. 4.1.2 Machine focus of attention A number of research groups are currently working on incorporating focus of attention mechanisms in computer vision algorithms. This section is by no means a comprehensive overview, but rather presents a few interesting examples. The Vision as Process consortium, ESPRIT Basic Research Action 3038 and 7108, was united by the scientic hypothesis that vision should be studied as a continuous process. The project is aimed at bringing together knowhow from a wide variety of research elds ranging low level feature extraction and ocular reexes through object recognition and task planning [69, 22]. Ballard and Brown have produced a series of experiments with ocular reexes and visual skills [6, 11, 13, 12, 7]. The basic idea is to use simple and fast image processing algorithms in combination with a exible, active perceiving system. A focus of attention system based on salient features has been developed by Milanese [58]. A number of features are extracted from the input image and are represented in a set of feature maps. Features diering from their 84 HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION surroundings are moved to a corresponding set of conspicuity maps. These maps consist of interesting regions of each feature. The conspicuity maps are then merged into a central saliency map where the attention system generates a sequence of attention shifts based on the activity in the map. Brunnstrom, Eklund and Lindeberg have presented an active vision approach to classifying corner points in order to examine the structure of the scene. Interesting areas are detected and potential corner points scrutinized by zooming in on them [15, 14]. The possibility of actively choosing the imaging parameters, e.g. point of view and focal length, allows the classication algorithm to be much simpler than for static images or prerecorded sequences. A variation of the search light metaphor, called the attentional beam has been developed by Tsotsos and Culhane [23, 66, 67]. It is based on a hierarchical information representation where a search light on the top is passed downwards in the hierarchy to all processing units that contribute to the attended unit. Neighboring units are inhibited. The information in the 'beamed' part of the hierarchy is reprocessed, without the interference from the neighbors, the beam is then used to inhibit the processing elements and a new beam is chosen. 4.2 Space-variant sampled image sensors: Foveas 4.2.1 What is a fovea? Space-variant image sampling is not strictly necessary for studying focus of attention and gaze control, but the need for positioning the xation point is more evident for such a sensor. There are, however, compelling biological and technical reasons for exploring the use of space-variant sampled sensors. The human eye has its highest resolution in the center and decays toward the periphery. The rst 15 it decays linearly with the angle to the optic 4.2 SPACE-VARIANT SAMPLED IMAGE SENSORS: FOVEAS 85 axis, then the reduction is even faster. The center part is called the fovea and has a visual eld of about 1:75 . This corresponds to about 45 millimeters 1.5 meters away. As a comparison, the total visual eld using both eyes is about 180 [38]. There are a number of advantages in such an arrangement, for example: Data reduction compared to having the whole eld of view in full resolution. High resolution is combined with a broad eld of view. The eects of image warping due to wide angle lens distortion is reduced. The distortion increases with the angle from the optic axis, but the resolution decreases. The periphery vision gathers information about other possible points of interest, and of contextual inuence. Therefore, the processing in this area does not have to be so comprehensive. These advantages can be utilized in a robot vision system as well. There are a number of research projects developing both hardware and algorithms for space-variant sampled image arrays, all exemplifying implementations of the fovea concept [65, 64, 68]. In human vision fovea refers to the central part of the retina, but in robot vision the term is often used to indicate that the system treats the central and the peripheral parts of the eld of view dierently. 4.2.2 Creating a log-Cartesian fovea The log-Cartesian fovea representation of the eld of view is a central part of the system presented in this thesis, and it can be seen as a special case of a subsampled resolution pyramid. The dierence is that it is only the center part that is represented in all resolutions, or scales (Figure 4.1 on the following page). 86 HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION Figure 4.1: Upper left: Original image. The skewed appearance is due to the broad eld of view, 90 . Upper right: Low Pass pyramid. Lower left: Center part of each level in the pyramid. Lower right: Interpolated center part images stacked on top of each other. The borders between the levels are marked for clarity. 4.2 SPACE-VARIANT SAMPLED IMAGE SENSORS: FOVEAS Figure 4.2: Topmost left: The original 512 512 image. The following ve 32 32 images are the log-Cartesian fovea representation. The levels correspond to the visual angles 90, 53, 28, 14 and 7 respectively. The levels are numbered from 0 to 4 where 0 corresponds to the nest resolution. 87 88 HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION In the system presented here the log-Cartesian fovea is generated by repeated LP-ltering, subsampling and cropping in octaves. The input image is LP-ltered with a Gaussian ( = 4= [36]) which allows subsampling by a factor two. The new image then has the same eld of view but half the number of pixels in each direction. The subsampling procedure is repeated until the desired number of levels is reached. This generates a subsampled resolution pyramid of the same eld of view but dierent size on each level. All levels are then cropped to the same size by keeping the center parts and discarding the rest. The result is a number of levels with the same number of pixels but with dierent resolution, i.e. covering dierent elds of view. In a real system it would be preferable to have an optic system that generates the fovea as the input images are digitized. Figure 4.1 shows an example starting with a 512 512 image and using 5 levels. LP-ltering and subsampling four times reduces the size of the nal level to 32 32 pixels. Cutting out the center 32 32 pixels at each 512 = 51:2. Figure 4.2 on the page level reduces the data by a factor 512 53232 before shows the individual images in the fovea representation. The lower right image in Figure 4.1 shows the total eld of view which is visualized by interpolating and combining the images. Note that although the fovea representation is often visualized as one image with varying resolution, it is actually N separate images with dierent resolution. The levels are numbered from 0 to N , 1, where 0 corresponds to the highest resolution. 4.2 SEQUENTIALLY DEFINED, DATA MODIFIED FOA 89 4.2.3 Image operations in a fovea The image operations that are used are applied on all levels of resolution. Figure 4.3 on the following page illustrates two dierent ways of handling the part of a lter that reaches outside the image. The image is the nest resolution part of the fovea representation in Figure 4.2 on page 87. When convolving an image with a lter, it is common to pad the image by repeating the border pixels as far as the lter reaches outside the image as in the upper left image of Figure 4.3. Note how the texture is expanded into linear structures to the right and below the image. Also note how the lower right corner pixel turns into a large square. On images much larger than the lter this distortion is often acceptable since it is only a small portion of the total image. In a fovea representation, however, the images are only a few times larger than the lters. Especially when using successive ltering this becomes a problem, since the border eects spread, and eventually dominate the results. When ltering a particular level in a fovea representation, that level can be padded with information from the nearest coarser level, which covers a larger neighborhood but with lower resolution, (lower left in Figure 4.3). This means that it is possible to get a better border by interpolating the corresponding area in the nearest coarser level. The unwanted border effects that otherwise might disturb the algorithms are then reduced. Note, for instance, that the cube appears to be much larger when border extension is used. This does not happen if the information instead is picked from the nearest coarser level, (lower right in Figure 4.3). 4.3 Sequentially dened, data modied focus of attention 4.3.1 Control mechanism components Having full resolution only in the center part of the visual eld makes it obvious that a good algorithm for positioning the xation point is neces- 90 HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION Figure 4.3: Upper left: One level in a fovea. Upper right: Padding by copying border pixels. Note how the texture is expanded into linear structures to the right and below the image. Also note how the lower right corner pixel turns into a big square. Lower left: Next coarser level gives information about a larger neighborhood around the same point, but with lower resolution. Lower right: Using interpolated information from the coarser level to pad the image. The interpolated information is more coherent with the real neighborhood than the extended border. sary. A number of focus-of-attention control mechanisms must be active simultaneously to be able to both handle unexpected events and perform an eective search. The dierent components can roughly be divided into the following groups: 4.3 SEQUENTIALLY DEFINED, DATA MODIFIED FOA 91 1. Preattentive, data driven control. Non-predicted structured image information and events attract the focus-of-attention in order to get the information analyzed. 2. Attentive, model driven control. The focus-of-attention is directed toward an interesting region according to predictions using already acquired image information or apriori knowledge. 3. Habituation. As image structures are analyzed and modeled their impact on preattentive gaze control is reduced. The distinction between the preattentive and attentive parts is oating. It is more of a spectrum from pure reexes to pure attentional movements of the xation point. 4.3.2 The concept of nested regions of interest In the hierarchical system shown in Figure 1.2 all levels might have an idea about how to position the camera in order to solve their own current problems. One way of solving this is to let the dierent levels take turns in controlling the camera. Another way is to recognize the fact that the interesting area for a level is often a sub-area of the interesting area of the level above. Moreover, the task of a level is often directly related to the task of the level above. Positioning the xation point can therefore be a renement from coarse to ne, where higher levels give the major region of interest and lower levels adjust to interesting areas within that region. Consider the following stylized example: Assume there are the following four major levels, from top to bottom, in the information processing hierarchy on the left hand side of the pyramid (the names are borrowed from the Esprit project BRA 3038, Vision as Process): 1. System Supervisor 2. Symbolic Scene Interpreter HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION 92 Figure 4.4: Regions of interest for the dierent subsystems in the hierarchy. 3. 3D Geometric Interpreter 4. Low Level Feature Extractor Further assume that we have a table with a few objects, as in top left, Figure 4.4, and the task is simply \watch the table". Assuming that the system supervisor knows where the table is, it determines a region of interest (top right). Note that the circles only refer to the positioning of the xation point, not to any other limitations in the viewing eld. The xation point is in the middle of the table. The symbolic scene 4.3 SEQUENTIALLY DEFINED, DATA MODIFIED FOA 93 interpretor has found indications of an object or group of objects in the upper left corner of the table and sets its region of interest accordingly (lower left). The 3D Geometric Interpreter starts its task by a further renement of the region of interest by marking the bright object (lower right). Finally, on the lowest level, the Low Level Feature Extractor selects an area to start model the structures in the image. The xation point is now moved towards the interesting object by the nested regions of interest. The general response command \watch the table" has been transformed into \focus on the set of objects in the upper left corner of the table". The border of the regions of interest are not to be absolute, rigid boundaries. They are rather recommendations that can be neglected if there are good reasons. How good a reason must be is controlled with an importance value. This can be illustrated by viewing the region of interest as a basin, or potential well, within which the lower levels can move around freely. The slope and height of the wall is proportional to how important it is to keep the xation point there. There are, however, situations when the lower levels are supposed to violate the directives from superior levels. One such occasion is event detection. Here, an event is an unpredicted gradient in space and/or time. When an interesting or important stimuli is detected, the xation point should be moved in that direction on a reex basis. Suppose the system is watching the table and an object enters the scene and is detected in the 'corner of the eye'. The resolution in the periphery is probably not high enough to see what it is, only where it is. The low level feature extractor is the rst level to detect the event and reacts by pulling the xation point towards the event. By the time the higher levels react on the event there are extracted low level features with high resolution available. The higher levels will now either move their regions of interest to analyze the event further or force the xation point back to the original position. Thus, the dierent regions of interest do not have to be determined in the order indicated in Figure 4.4. It all depends on whether it is an attentive or preattentive movement of the xation point. 94 HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION Figure 4.5: The robot conguration. The robot is a Puma 560 arm with six degrees of freedom equipped with a head with two movable cameras with one degree of freedom each; the camera pan angle. 4.4 Gaze control 4.4.1 System description In the experiments below the robot consists of an arm with a camera head, (Figure 4.5). The robot is a Puma 560 arm with six degrees of freedom. In the experiments presented here only two degrees of freedom are used. They implement head pan and head tilt angles. The head has two movable cameras with one degree of freedom each; the camera pan angle. The purpose of this system might be automatic identication, inspection or even surveillance of the objects in the scene in front of it. This type of robotic vision system is wide-spread, see for instance [1, 11, 21]. The system, both image generation and analysis, is implemented in the Application Visualisation System, AVS, which is described in Chapter 6. 4.4 GAZE CONTROL 95 The cameras are equipped with log-Cartesian sensors. The total eld of view is 90 and the fovea consists of 5 levels. The individual elds of view are in the experiments 7 , 14 , 28 , 53 and 90 respectively. The outward response of this particular system is designed to enable information gathering. Interesting events in the eld of view attract the xation point in order to get them within the high resolution part of the fovea. The robot does not move objects. In this experiments it is only permitted to change the point of view using the head pan and tilt, and camera vergence. The robot has a gaze control system with three levels (Figure 4.6). The top level is an object detector, or rather a symmetry detector [9, 37, 82], drawing the attention towards regions of high complexity. The second level is a line tracker drawing the xation point towards, and along, linear structures based on local orientation and phase [44, 75, 76]. The lowest level verges the cameras to make them have the same xation point using the disparity estimates from the stereo algorithm described in Chapter 3. 4.4.2 Control hierarchy The left hand side of Figure 4.6 shows the feature hierarchy with increasing abstraction going upwards. More abstract features are stepwise composed from simpler ones. The features are used both as ground for more complex features and as modiers for response outputs. The right hand side of the same gure shows the response hierarchy with increasing specicity going downwards [33, 72]. The renement of the positioning of the xation point is handled with potential elds in the robot's parameter space [52]. It can be visualized as an 'energy landscape', as in Figure 4.10 on page 103, where the xation point trajectory is the path a little ball freely rolling around would take. The xation point can be moved to a certain position by forming a potential well around the position in the parameter space causing to the robot to look in that direction. 96 HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION DISTMASS ROTCONS "OBJECT FINDER" 2D PHASE STEREO "EDGE TRACKER" ORIENT FOVEA CAMERA INPUTS HEAD CTRL VERGENCE CAMERA CTRL HEAD CTRL Figure 4.6: The preattentive focus of attention system. 4.4.3 Disparity estimation and camera vergence The lowest level in the control system is the camera vergence process. The cameras are verged symmetrically to get zero disparity in the center of the image regardless of the state of the system. If the head is moving the vergence is calculated using the disparity from the part of the eld of view that is predicted to be centered in the next time step. The disparity is measured using the multi-scale method based on the phase in quadrature lters described in Chapter 3. The fovea version has two major dierences compared to the computation structure in Figure 3.2 on page 39. First, only the center of the eld of view is represented in all resolutions. The accuracy of disparity estimates therefore decays towards the periphery of the eld of view. Second, the edge extractor is used on 4.4 GAZE CONTROL 97 α Figure 4.7: Vector representation of local orientation. every level of the input pyramid, instead of creating a pyramid from the edge representation of the nest resolution. This dierence is motivated by a possible future fovea sensor array. In such a system it would not be possible to have the computational structure in Figure 3.2, since a high resolution image of the total eld of view does not exist. 4.4.4 The edge tracker Apart from estimating disparity, the phase from quadrature lters is also used to generate a potential eld drawing the attention towards and along lines and edges in the image [75]. This is the second level in the control structure in Figure 4.6. Local orientation An algorithm for phase-invariant orientation estimation in two dimensions is presented in [44]. Phase-invariant means that the orientation of a locally one-dimensional signal can be estimated regardless of it being an edge or a line, i.e. regardless of the phase. The orientation is represented as a complex number, where the argument represents the local orientation estimate and the magnitude indicates the certainty of the estimate [32]. 98 HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION Figure 4.7 shows the correspondence between the complex number and the local orientation. Note that the argument varies twice as fast as the orientation of the local structure: z = Mei = Mei ' 2 (4.1) where ' is the the angle between the gradient and the horizontal axis. Rotating a line radians makes it look the same as the initial line which means that the representation has to be the same. The key feature of this representation is that maximally incompatible orientations are mapped on complex numbers with opposite signs. This, in turn, makes averaging a meaningful operation. When working with resolution pyramids, information is LP ltered and subsampled, interpolated etc. The information representation has to be continuous in order for these operations to give meaningful results. If the average of the representations of the structure in two points represents a structure that is completely dierent, then the representation is not useful in a resolution pyramid. Representation of phase in higher dimensions In Chapter 2 only one-dimensional signals are discussed. The extension of the phase concept into two or more dimensions is not trivial [36]. In Section 2.1 it is shown that the local phase is connected to the analytic function and hence to the Hilbert transform. A direction of reference has to be introduced in order to make a multi-dimensional denition of the Hilbert transform possible. Thus, local phase needs a direction of reference as well. If a continuous representation is desired, the phase cannot be represented with only a single value, although it is a scalar. The phase representation has to include both the phase value and the reference direction. This means in the general case that if the dimensionality of the signal space is N, the dimensionality of the phase representation is N+1 [34]. 4.4 GAZE CONTROL 99 B C A D F E Region A B C D E F ^e ' &0 =2 . &0 =2 . phase =2 =2 =2 ,=2 ,=2 ,=2 Figure 4.8: A dark disc on a bright background. The ^e vectors, marked with arrows, are used as the phase reference direction. Note the opposite signs on the phase in regions A and F, and in regions C and D. The table contains the phase reference direction as an angle to the horizontal axis, ', and the phase value, . Figure 4.8 shows an example in 2D where the neighboring regions A and F, and regions C and D have phase estimates, , with opposite signs. This makes a meaningful averaging impossible. For instance, if fC and fD denote the phase lter outputs, the average between region C and D is: (4.2) aver = arg 12 (kfC keiC + kfD keiD ) = arg 12 (kfC kei=2 + kfD kei(,=2) ) (4.3) i = arg 2 (kfC k , kfD k) (4.4) Thus, the average phase can be =2, ,=2, or even undened depending on the relationship between the lter magnitudes in the regions. The reason for the shifting sign on the phase value is the denition of the reference direction, marked with arrows in Figure 4.8. The reference direction is extracted from the orientation estimate by halving the argument: ! ! e cos(arg( z ) = 2) 1 ^e = e2 = sin(arg(z)=2) (4.5) HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION 100 Since the phase is measured along ^e, it will change sign if ^e changes to the opposite direction. Two neighboring points may have ^e in opposite directions and thus phase values with opposite signs, although they belong to the same image structure. Averaging of such a neighborhood would therefore be meaningless. It can be argued that choosing the phase reference directions such that they all point out from the object solves the problem, but it is impossible to locally determine what is the inside or the outside of an object. With only local information available, region A could for instance be a region F on a white disc on a dark background. A 2D-phase representation, suggested by Knutsson, that includes the reference direction in a two-dimensional space is: 0 1 0 1 x M cos(') sin() x = [email protected] CA = [email protected] sin(') sin() CA 1 2 x3 M cos() (4.6) where M 2 [0; 1] is the signal energy, ' 2 [0; ] is the reference direction, and 2 [,; ] is the phase value. Resolving the phase angle, , gives: = arctan q x21 + x22 ; x3 (4.7) Figure 4.9 shows the representation which can be interpreted as a 3D vector of length M rotated an angle in a plane dened by ^e and x^3 . The shaded circle corresponds to the phase representation in 1D shown in Figure 2.3 on page 9. An intuitive feeling for how this representation solves the problem in the example above can be obtained by some mental imaging. Turn ^e around x3 until it points in the opposite direction. The phase value, , is then dened in the opposite direction. In other words, when the reference direction changes sign the phase angle denition also changes sign. 4.4 GAZE CONTROL 101 x3 θ M x2 ϕ e x1 Figure 4.9: A 3D representation of phase in 2D. The phase estimate in region C in Figure 4.8 on page 99 is now: 0 1 0 1 M cos() sin(=2) ,MC C xC = [email protected] MC sin() sin(=2) CA = [email protected] 0:0 CA MC cos(=2) 0:0 (4.8) and in region D it is: 0 1 0 1 M , M D cos(0) sin(,=2) D xD = [email protected] MD sin(0) sin(,=2) CA = [email protected] 0:0 CA MD cos(,=2) 0:0 (4.9) The average phase in a neighborhood is simply the average of the components, xi , respectively: 1 0 1 0 (x C + x D ) ((,MC ) + (,MD )) CA xaver = [email protected] (x C + x D )CA = [email protected] 0:0 1 2 1 2 1 2 1 1 2 2 (x3C + x3D ) 1 2 0:0 (4.10) Note that the direction of the average phase vector is now independent of the signal energy in the two lter as opposed to the case in Equation (4.2). 102 HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION The average phase angle, aver , can be calculated using Equation (4.7): 1 (4.11) aver = arctan 2 (MC + MD ) ; 0 = =2 which is independent of the relationship between MC and MD . Estimating line/edge position and orientation The local orientation estimates generated from the input images are directly useful for following locally one-dimensional structures in the image since they point out in which direction to move. The phase estimates, however, do not directly point out the direction to move since it depends on whether the estimates are generated from an edge or a line. If it is a bright line one should move towards = 0, and if it is an edge one should move towards = =2, and so on. To get around this problem, the magnitude of the orientation algorithm is used as input for the 2D-phase algorithm: m() = kz()k (4.12) The orientation magnitude image,m( ), forms a line sketch of the image where lines and edges look the same regardless if they are bright lines on a dark background, dark lines on bright background, bright-to-dark edges or dark-to-bright edges. The 2D-phase estimate will therefore give the distance to the one-dimensional structure. Moving the xation point towards = 0 will now be correct for both lines and edges in the original image. The 2D-phase is applied on a region of interest covering the center pixels on each level in the fovea. Denote the average phase estimate on level j : xj = N1 N X i=1 x(i) (4.13) Typically, the four center pixels are used to get the 2D-phase value on each level. The 2D-phase vector magnitude can be visualized as the energy landscape, or potential eld, in Figures 4.10 and 4.11. Note how energy valleys follow the locally oriented structures on each level. 4.4 GAZE CONTROL Figure 4.10: Potential elds generated by lines and edges. Top: Level 0, 7 view eld. Bottom: Level 1, 14 view eld. The xation point is on the edge of the cube on the table. 103 104 HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION Figure 4.11: Potential elds generated by lines and edges. Top: Level 2,28 view eld. Middle: Level 3, 53 view eld. Bottom: Level 4, 90 view eld. The xation point is on the edge of the cube on the table. 4.4 GAZE CONTROL 105 α Figure 4.12: Vector representation of rotation symmetries. 4.4.5 The object nder The third level in the control hierarchy in Figure 4.6 concerns objects, or rather possible objects. Reisfeld et al argue that symmetries are important features for preattentive gaze control [62]. The object nder is based on rotation symmetries. These symmetries are dened as the rotations of the orientation estimates within a neighborhood [46, 9, 82]. Figure 4.12 shows the vector representation of these symmetries. Note how complex values with opposite signs again represent maximally incompatible patters. Overlaying the concentric circles on the star gives orthogonal line crossings everywhere. This is also true for the two spiral patterns. It might be hard to see that the pattern transformation is continuous when changing . By studying the orientation estimates generated from the patterns, the continuity is apparent. A consistency algorithm is applied to enforce the neighborhoods that t the symmetries well [47, 48]. 106 HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION Rotation symmetry estimation The orientation estimates for the concentric circles pattern can be written as a function of the distance to the center and the angle to the horizontal axis: f () = f (kk)ei2 arg() (4.14) where is the position vector from the center of the symmetry. The corresponding function for the star pattern is: f () = f (kk)ei2 arg()+ (4.15) i.e. a phase shift of . The spiral patterns correspond to phase shifts of =2. The general functions is: f () = f (kk)eif () = f (kk)ei2 arg()+ (4.16) where is determined by the pattern according to Figure 4.12. A lter for detecting these symmetries should display the conjugated symmetry itself: b() = b(kk)eib () = b(kk)e,i2 arg() (4.17) where is the position vector from the center of the lter. The magnitude function can be any window function. Here the magnitude is a squared cosine function with zero magnitude in the center: q 8 , r <cos 8 if 1 r = 12 + 22 7 kb()k = : (4.18) 0 otherwise The lter response when centered on a rotation symmetry pattern is: X s(0) = f ( , )b()j=0 (4.19) = = X X f (k0 , k)eif (0,) b(k k)e,ib () (4.20) f (kk)b(kk)eif (,),b (,) (4.21) 4.4 GAZE CONTROL 107 Use the denitions in Equations (4.16) and (4.17) in Equation (4.19): s(0) = = X X f (kk)b(k k)ei2 arg(,)+,2 arg(,) f (kk)b(k k)ei = kskei (4.22) (4.23) (4.24) Equation (4.22) shows that the lter, b, estimates the correct rotation symmetry when it is centered on it. Unfortunately the lter also responds o center and to linear structures. The selectivity can be enhanced by using a consistency algorithm [47, 48, 82]. This algorithm requires three additional lterings with dierent combinations of lter and data magnitudes. The four lter results are: s s s s =f b 2 = f kbk 3 = kf k b 4 = kf k kbk 1 (4.25) (4.26) (4.27) (4.28) The second convolution s2 is obtained by using the lter magnitude as a scalar lter on the input data. Similarly, s3 comes from using the complex lter with the data magnitude as a scalar image. Finally, the magnitude of the lter is convolved with the magnitude of the data. A consistency operation is obtained if the four outputs are combined as: s = s s s, s s 4 1 2 3 4 (4.29) Figure 4.13 on the following page shows a test pattern with rotation symmetries and the results of the symmetry detector. 108 HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION Figure 4.13: Left: Rotation symmetry test pattern. Right: The results from the symmetry detector overlayed on the original image. This image pair is borrowed from [81]. The test pattern originally appeared in [9]. Rotation symmetry localization Objects small enough to be covered with one glance can be seen as an imperfect instantiation of the concentric circles i.e. s 0. The estimates are therefore attenuated with the argument, s , i.e. attenuating the estimates from star-like patterns: 8 <ksk cos (s) if , s so = : 0 otherwise 2 2 2 (4.30) The result is a 'closed area detector'. It marks areas with evidence for being closed, and the intensity is a measure of how much evidence there is. If the concentric circle estimates are attenuated instead, the operation turns into a corner detector. 4.4 GAZE CONTROL 109 A vector eld pointing towards the local mass center of so is produced with three separable lters: 8 <cos ( ) cos ( ) ,7 ; 7 hm () = : 0 otherwise (4.31) h1 () = hm ()1 h2 () = hm ()2 (4.32) (4.33) 2 2 1 8 2 8 1 2 The output from hm is used both for normalization and as a rotation symmetry certainty image: Mm = hm so (4.34) The vector eld with vectors pointing to the local mass center is: h s =M Vm = 1 o m h2 so=Mm ! (4.35) Interpreting the vector elds on all levels, Vmj , as gradient elds of energy landscapes gives the potential elds in Figures 4.14 and 4.15. Note how a potential well is created everywhere where there is evidence for a closed contour. 110 HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION Figure 4.14: Potential elds generated by rotation symmetries. Top: Level 0, 7 view eld. Bottom: Level 1, 14 view eld. The xation point is on the edge of the cube on the table. 4.4 GAZE CONTROL Figure 4.15: Potential elds generated by rotation symmetries. Top: Level 2, 28 view eld. Middle: Level 3, 53 view eld. Bottom: Level 4, 90 view eld. The xation point is on the edge of the cube on the table. 111 112 HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION Tilt π/2 0 −π/2 −π 0 π Pan Figure 4.16: The quantization of the pan-tilt parameter space used as memory. 4.4.6 Model acquisition and memory So far, only processes that attract the xation point have been considered. In Section 4.3, habituation was mentioned as a mechanism needed for an operating focus of attention system. The basis for such a mechanism is some sort of memory. A rst step toward a rudimentary memory of where the system has looked before is shown in Figure 4.16. It is a form of motor memory consisting of an array that quantizes the parameter space spanned by the head pan and tilt angles. Since the robot moves only these two joints the parameter space is two-dimensional and there is a one-to-one mapping to the possible view directions. The system remembers where it has seen something by marking the positions in the memory array that corresponds to the xation directions in which it has been tracking lines and edges. Bilinear interpolation is used between neighboring bins. The memory is used to indicate that an edge or a line has been tracked before and the system should move its xation point elsewhere. In a general system where many points in the parameter space might correspond to looking at the same thing, an extended approach to memory 4.4 GAZE CONTROL 113 B.H.B. Track line O.L. L.L. Search line Avoid object S.L. S.L. C.S. N.S. Locate object Figure 4.17: State transition network for the test system. is needed. It is then important to remember not only where but also what the system has seen. For non-static scenes when becomes important. This requires a procedure for model acquisition which is an ultimate goal for this process. 4.4.7 System states and state transitions The potential elds in Figures 4.10, 4.11, 4.14 and 4.15 are weighted together dierently depending on what state the robot is currently in. Figure 4.17 shows the states and the possible transitions. The transitions between the states are determined by the type and quality of the data in the xation point. Before going into the details, an overview of the states and the state transitions is presented. Suppose the system is in the state of locating a possible object. It then uses the rotation symmetry estimates on the coarser levels of the fovea representation. When the distance to a symmetry is small enough, (Close to Symmetry), the system starts to search for the lines and edges of the object. The edge tracking procedure starts when the line or edge is xated (On Line). If the line or edge is lost, (Line Lost), the line search starts again. The system moves away from an object when the xation point returns to a position where it has tracked before, (Been Here Before). 114 HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION When a new symmetry is encountered, (New Symmetry), the system starts moving towards it. If the symmetry is lost, (Symmetry Lost), the system starts searching for linear structures that hopefully will lead towards new interesting areas. The camera and head parameters are calculated by dening an image point to be xated. The image point can be seen as attracting the gaze and is therefore called attracting point and is denoted v. The next xation point is estimated independently for the right and left cameras. The camera orientation parameters are then calculated from an average of the two xation points: v = 12 (vl + vr ) (4.36) Also the system state transition is determined for the left and right view independently and then combined. If the transitions are not consistent, the following ranking order is used from high to low: locate object, avoid object, track line, search line. This means that if one eye wants to switch to locate object while the other wants to continue avoiding, the system will switch to locate object. The reason is that one eye might catch a new object before the other one does. State: search line When searching for a line or edge only the phase information is used. The xation point should move towards and then along the valleys in Figures 4.10 and 4.11. The 2D-phase information therefore has to be transformed into vectors. This can be accomplished by a coarse to ne approach. Figure 4.18 shows the phase lter magnitude for a line located at = 1 on three scales. The ne scale lter has larger amplitude than the other two close to the line, while the coarse scale lter has the largest amplitude at a distance. The phase values give the distance to the line as in the disparity estimation in Chapter 3. Since the magnitude is a certainty estimate, the fovea level with the highest magnitude for a given xation point should Phase filter magnitude 4.4 GAZE CONTROL 115 0.6 0.4 0.2 0 2 4 6 8 10 12 14 16 ξ Figure 4.18: The phase lter magnitude on three scales for a line located at = 1. control where to move. Let J denote the level with the largest magnitude: kxJ k = max kxj k j (4.37) Level J is called the controlling level. The phase angle J gives the distance to the line while the reference direction vector, e^, dened in Equation ( 4.5 on page 99), gives the direction to it. In order to move along an oriented structure a vector, e^?, perpendicular to e^ has to be chosen. There are always two opposite alternatives when choosing e^?. The alternatives are equally good, so any will do, e.g. e^? = ,ee 2 ! 1 (4.38) The problem is now that the direction of e^ may ip from one point to another, cf. regions A and F in Figure 4.8 on page 99. This makes the xation point move back and forth over the discontinuity. Such behavior is avoided by using the direction of the last xation point motion, vlast : sign(vlast e^?) (4.39) If e^? changes to the opposite direction, the sign of the scalar product above also changes and the xation point continues to move without changing direction. HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION 116 The vector to the next xation point, v, consists of one part directed along the linear structure and one perpendicular to it: v = 2J sign(vlast e^?)?e^? + J e^ (4.40) The factor, 2J , compensates for the compressed distances on levels with lower resolution. The constants ? and control the speed of the xation point motion along and towards the linear structure. has a natural connection to the lter. Setting (4.41) =1 c where c is the center frequency of the lter, makes the xation point move directly to a line if it is an impulse line. Normally, the phase varies slower than this since an image mostly contains lower frequencies. This gives a too small value of the distance to the line which, from a control point of view, is advantageous since it makes the system stable. ? does not have such a natural interpretation as does. A rule of thumb is that the xation point should not move further than the controlling fovea level can reach. When in search line state, is set according to Equation (4.41), while ? is set so the motion towards the line is larger than the motion along it: ? = 0:25 : (4.42) If ? is too large and the line bends away from the tangent of the xation point motion, then the xation point might never get close to the actual line. Figure 4.17 on page 113 shows that the only state transition from search line is to track line (On Line). There are two conditions that have to be fullled in order to make a transition. First, the distance to the line has to be small enough. This can be expressed as a condition on the controlling fovea level: J 1 (4.43) 4.4 GAZE CONTROL 117 i.e. if one of the two nest levels of the fovea has the maximum phase magnitude, then the xation point is close enough to start tracking. Second, a scale consistency condition is used in order to reduce the impact of noise: ke^J? e^J? k TC +1 (4.44) If the scalar product between the orientation estimate on the controlling level and the next coarser level is smaller than TC , then the estimate is considered to be noise and therefore discarded. Note that if, for instance, level 0 is inconsistent with level 1 but level 1 is consistent with level 2, then level 1 will be used even if level 0 has a larger magnitude. In most experiments, the value on the consistency threshold is: (4.45) TC = p1 2 which is heuristically determined. State: track line The same information is used for tracking lines as for searching for lines. The only dierence is that ? is now larger: ? = : (4.46) There are two possible state transitions from track line (Figure 4.17 on page 113). The rst one is Line Lost which returns the system to search line if the conditions in Equations (4.43) and (4.44) no longer are fullled. This typically happens when a line or edge bends abruptly. The other state transition, Been Here Before, involves the parameter space memory array (Figure 4.16 on page 112). The memory array is updated during the tracking. For each new xation the corresponding memory location is incremented with one, or rather, the four closest locations are updated using bilinear interpolation. If the head returns to a position in the parameter space where it has been tracking before, and 118 HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION the memory value is larger than a threshold, Tm , the system changes the state to avoid object. The value on the threshold is heuristically set to: Tm = 3:0 (4.47) State: avoid object When avoiding an object, the symmetry information on the three coarsest levels is used. The intensity in the \mass image", Mm (), (Equation (4.34)) is a measure of how much evidence there is for a symmetry. The controlling level, J , is determined by searching from coarse to ne in Mm : MmJ = max Mmj ; j 2 [2; 3; 4] j (4.48) v = ,2J s VmJ (4.49) The vector, v, to the next xation point is then: where s generally is set to unity. There are two state transitions from avoid object. If the rotation symmetry information is lost, i.e. if MmJ = 0; (4.50) the system returns to search line in order to nd some structure again (Symmetry Lost). The second state transitions concern detection of a new object (New Symmetry). During avoid object the xation point is moving contrary to the vector eld Vm (Equation (4.49)). When the xation point reaches a new symmetry, the vector VmJ switches sign and points towards the new symmetry, i.e. in approximately the same direction as the current motion of the xation point. Thus, the condition for the state transition to locate object is: vlast VmJ > 0 (4.51) 4.4 GAZE CONTROL 119 State: locate object When locating a new object the same information as when avoiding objects is used. The only dierence is that the xation point now moves along the vector eld, Vm instead of against it. The vector to the next xation, v, point is: v = 2J sVmJ (4.52) where s generally is set to unity. There is one state transition from locate object and that is to search line. The transition takes place if one of two conditions is fullled. First, as in the avoid object case, the system starts to search for lines if the symmetry is lost (Equation (4.50)). The second condition concerns actually arriving to a new symmetry, i.e. a potential new object. When searching for a line, a coarse to ne approach is used. This method is not applicable for rotation symmetries since the positions of symmetry centers are more scale-dependent. As an example, consider a square. On a coarse scale the symmetry center is in the center of the square. On a ne scale there are four symmetry centers close to the corners of the square. Therefore, a threshold on the distance to the symmetry center on the controlling level is used as a state transition condition: kvk < Ts (4.53) The value on Ts is half the width of the symmetry lter: Ts = 7 (4.54) 4.4.8 Calculating camera orientation parameters In order to derive the orientation control parameters from the position of an image point, a camera model has to be assumed. For physical cameras there are a number of models ranging from a simple pinhole camera to advanced simulations of light going trough aggregates of lenses. In computer graphics the pinhole camera dominates although more advanced cameras are available on some high end platforms. 120 HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION Kanatani [42] has derived the camera-induced motion eld for rotation around the camera optic center. This remains a good approximation if the distance between the lens and true center of rotation is small compared to the distance perpendicular to the lens to the projected objects [59]. Below, the corresponding equations for how to rotate a camera in order to xate a certain point are derived. The camera model The cameras are pinhole cameras with a square sensor and with equal vertical and horizontal elds of view. The imaging parameters are the eld of view, , and the number of sensor elements horizontally (and vertically) across the sensor, Ne . Neither the focal length, f , nor the physical size of the image sensor, w, are explicitly given here. The relationship between the eld of view, the image sensor size and the focal length is: tan 2 = 2wf Resolving gives: (4.55) w = 2 arctan 2f (4.56) It is evident from Equation (4.56) that for a given eld of view there is an innite number of combinations on the focal length and the sensor width. The focal length often appears as a denominator, and it is therefore convenient to have the following convention: f = 1:0 (4.57) w = 2 tan 2 (4.58) we = Nw (4.59) e where f is the focal length, w is the \physical" sensor size, and we is the sensor element (pixel) size. 4.4 GAZE CONTROL 121 Camera pan The optical center of the cameras coincides with the axes for the individual camera pan joints (Figure 4.19). This makes the change in pan angle, 'cp , needed to xate a point, P , independent of the distance between the optical center and P : tan('cp ) = wfe (4.60) where is the image coordinate. Resolving 'cp and using Equations (4.57), (4.58) and (4.59) gives the expression on the pan angle: 'cp = arctan N2 tan 2 e (4.61) f ψ w/2 we ξ ϕcp x P d Figure 4.19: The camera seen from above. The change in the camera pan angle to xate P , 'cp , is independent of the distance to P , since the camera turns around the optical center. 122 HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION Head tilt The head pan and tilt do not rotate the cameras around the optical center. This means that the change in head pan and tilt angles needed to xate a point depends on the distance to the point. Although it is possible to use depth estimates, e.g. from stereo, to calculate the correct angles, it is desirable to be as independent of information from other processes as possible. Figure 4.20 shows a side view of the head mounted on the robot ϕ we ξ we ξ error x1 h d2 ϕ P ht ϕ x2 d1 hp Figure 4.20: The error made when the change in tilt angle 'ht is calculated with Equation (4.61). The image of the P will not be exactly centered after tilting the head. arm. If the equivalent to Equation (4.61) is used to control the head tilt angle, 'ht , the image of P is not centered. If the error is small enough, the simplicity of using Equation (4.61) is preferable to calculating the accurate inverse kinematics. Figure 4.20 shows an example where the head is tilted to xate a point P . The following relations are found: x2 = h , h cos(')d2 = L , h sin(') (4.62) 4.4 GAZE CONTROL 123 q where L = x1 2 + d1 2 is the initial distance from the camera center to the P . The new position of the image of P can be calculated from the pinhole camera equation: (4.63) weerror = f xd2 2 Resolve error and combine with Equations (4.62) and (4.62): ')) (4.64) error = wf hL(1,,hcos( sin(') e Finally, use Equations (4.57), (4.58) and (4.59): 1 , cos(') error = Ne (4.65) 2 tan( 2 )( Lh , sin(')) The following qualitative observations can be made regarding the size of the error: 1. The error decreases when the L=h ratio increases, i.e. the error is smaller for distant points. 2. The error increases when the eld of view, , decreases, i.e. the error is larger with a telephoto lens than with a wide angle lens. 3. The error increases with the a large change in tilt angle, i.e. a large motion will yield a large error. Figure 4.21 on the next page shows the error expressed as a percentage of the sensor width, plotted as a function of L=h for a number of tilt angles. The worst case is when the attracting point is situated on the edge of the current eld of view, i.e. when 'ht = =2. The L=h ratio in the experiment is typically between 15 and 30 (h = 100 mm and 1500 mm < L < 3000 mm), which means that the worst case error is about 1% corresponding to 5 pixels. This might seem as a lot, but the change in tilt angle mostly is much less the half the eld of view. The continuous operation of the head also assures a correction in the next iteration since the correction pan angle then is very small. HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION 124 4 Error (%) 3 2 1 0 5 10 15 20 L/h 25 30 35 Figure 4.21: The xation error expressed as percentage of the sensor width as a function of L=h. The eld of view is = =2. The error is plotted for four changes in tilt angles: ' = =2 = =4 (dash-dotted), ' = 3=16 (dotted), ' = =8 (dashed), ' = =16 (solid). Head pan Calculating the error made for the head pan is more dicult than for the head tilt since it depends on the head tilt angle. In Figure 4.20 on page 122, the head pan axis is marked out. This axis is only a true pan axis if the tilt angle is zero. When the head is maximally tilted the pan axis is parallel with the optical axes of the cameras. When 'ht = 0 Equation (4.65) is applicable if h is set to half the camera baseline (50 mm). The L=h ratio is then typically between 30 and 60 which means that the error caused by panning is half the error caused by tilting the same angle. The analytical head pan error equation is much more complicated than Equation (4.65). Simulating a few representative cases gives an intuitive feeling for the error as a function of the eld of view, the distance, and the tilt angle. Figure 4.22 on page 126 shows the error made when the attracting point is on the edge of the image, which is the worst case. The eld of view is =6 and =2 for the left and right columns of plots respectively. The rows of plots show the error for a point at the distance L = 250 mm, L = 1; 000 mm, L = 2; 500 mm, respectively. The solid curve shows the error after one gaze shift. The dashed, dash-dotted and 4.5 EXPERIMENTAL RESULTS 125 dotted curves show the error after a second, third and fourth gaze shift. Note that the repeated gaze shifts also change the tilt angle. The following qualitative observations can be made regarding the size of error when changing the head pan angle: 1. The error increases when the distance increases if the tilt angle is larger than approximately =4. 2. The error increases with the tilt angle. Note that although the error is between 40 and 50 percent of the image size after the rst gaze shift, it becomes less than 10 percent after a second shift if the eld of view is =2. After three iterations the error is less than 2%. The worst case is fairly rare, and for tilt angles less than =4 the initial error is only 15 percent of the image width. The initial error can be minimized by using the camera pan joints in combination with the head pan. The cameras can be used for quick pan changes and when a point is xated, the head pan can be used to ensure symmetric vergence. This sort of control scheme is inspired by the human visual system and can for instance be found in [21, 60]. 4.5 Experimental results A trajectory of how the robot moves the xation point can be found in Figure 4.23 on page 127. The middle picture on the wall shows a clear example of how an object is xated and then scrutinized. The system makes a saccade from the right picture to the center of the middle picture. A state transition from locate object to track line occurs and the system starts tracking the periphery of the picture. When returning to the starting point on the frame of the picture, the system saccades to the table and continues there. HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION 50 40 40 30 20 10 10 π/8 π/4 3π/8 Tilt angle (rad) 0 0 π/2 50 50 40 40 Error (%) Error (%) 30 20 0 0 30 20 10 10 π/8 π/4 3π/8 Tilt angle (rad) 0 0 π/2 50 50 40 40 30 20 10 10 π/8 π/4 3π/8 Tilt angle (rad) π/2 π/4 3π/8 Tilt angle (rad) π/2 π/8 π/4 3π/8 Tilt angle (rad) π/2 π/8 π/4 3π/8 Tilt angle (rad) π/2 30 20 0 0 π/8 30 20 0 0 Error (%) Error (%) 50 Error (%) Error (%) 126 0 0 Figure 4.22: Head pan errors as a function of tilt angle. The eld of view is = =6 and = =2 for the left and right columns of plots respectively. The rows of plots show the error for a point at the distance L = 250 mm, L = 1000 mm, L = 2500 mm, respectively. The solid curve shows the error after one gaze shift. The dashed, dash-dotted and dotted curves show the error after a second, third and fourth gaze shift. 4.5 EXPERIMENTAL RESULTS 127 Figure 4.23: A typical trajectory of the xation point. The xation point has followed the structures in the image and moved from object to object. In these experiments only features extracted from gray scale structures have been used. A natural extension is to incorporate color edges [83, 79, 78], texture gradients [45], etc. in order to get a better segmentation of the image. The potential elds give the possibility to do this fairly easy. If for instance a color and gray scale edge coincide, the potential well will be much deeper than if it is a gray scale edge on a region with constant color. The results also show that a set of individually simple processes together can produce a complex and purposive behavior. When adding higher levels to the system, they should inuence the lower ones by generating potential wells corresponding to interesting directions and not control the cameras directly. In this way all levels of a hierarchical system control the xation point simultaneously. 128 HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION 5 ATTENTION CONTROL USING NORMALIZED CONVOLUTION 5.1 Introduction Chapter 4 describes a gaze control algorithm based on a number of simultaneously working subsystems. In this algorithm, the gaze is attracted by a set of features in the scene, and a kind of motor memory is used to remember earlier xations. The three basic processes; preattentive control, attentive control and habituation are pointed out as vital for an active observer. The model, or memory, cannot be seen as implementing the habituation function, since the xation point has to return to the particular point in order to know it has already been there. The desired behavior of the system is to act as if certain features and events do not exist,i.e. neither being attracted nor repelled by them [23]. The straightforward way of simply \cutting out" or erasing the corresponding areas in the input image or the feature image does not solve the problem. The inuence of an image structure reaches over a large area, especially at coarse resolution. Erasing all points in a feature image that are inuenced by a certain image feature removes many useful estimates as well. Erasing in the input image creates \objects" that generate new image features. It can be argued that building a repelling potential eld around known, or modeled, structures solves the problem with the returning gaze point, but it does not. The attraction of the feature remains, but it is balanced by the repelling potential eld. This might cause the xation point to stop in 129 ATTENTION CONTROL USING NORMALIZED CONV. 130 a local minimum not corresponding to any structure in the eld of view, since it is impossible to build a eld that exactly cancels the attracting eld. The repelling potential eld also forces the xation point to move around an already modeled structure when passing it. Using a technique termed 'normalized convolution' when extracting the image features allows for marking areas of the input image as unimportant. The image features from these areas are then 'invisible' and consequently do not attract the attention of the system, which is the desired behavior of a habituation function. 5.2 Normalized convolution Normalized convolution (NC) is a novel ltering method presented by Knutsson and Westin [48]. The convolution method is designed for ltering of incomplete and/or uncertain data. A comprehensive theory is found in [82]. A central concept in this theory is the separation of signal and certainty for both the lter and the input signal. Normalized convolution is based on viewing each lter as a set of one or more basis functions, bi , and a weighting window, a, called the applicability function. The window is used for spatial localization of the basis functions which may be considered to have innite support. Similarly, the input signal is divided into the actual signal f and a certainty function, c. Standard convolution of a signal, s, with a set of lters, hi can be expressed as: 0 f 1 0 h s 1 0 ab cf 1 [email protected] ... CA = [email protected] ... C A = [email protected] ... CA fN hN s abN cf 1 1 1 (5.1) Let the normalized convolution between abi ; i 2 f1; ::; N g and cf be 5.2 NORMALIZED CONVOLUTION dened by: 0f 1 B @ ... CA = G, 1 1 1 fN 0f 1 [email protected] ... CA = G, 0 ab cf 1 [email protected] ... CA abN cf 1 1 fN 131 (5.2) where G is a matrix dened as: 0 ab b c G =B @ ... abN b c 1 1 : : : ab1 bN c 1 ... .. . abN b CA (5.3) c where bi denotes the complex conjugate of bi . The matrix G is a met1 ::: N ric that compensates for the non-orthogonality of the basis functions. A number of product lters, abi bj are used in order to estimate G. Note that G depends on the certainty functions but is independent of the actual signal. This means that if the certainty is independent of variations in the signal over time, G,1 can be pre-calculated. Typical examples where G is constant over time are: Border eect reduction. The signal certainty is set to one inside the input image and to zero outside it. This reduces the eects caused by the image border. Sparse or heterogeneous sampling. Non-Cartesian, sparse heterogeneous sensor arrays or sensor arrays with malfunctioning sensor elements can be handled by setting the signal certainty to zero for those points. Filtering of an image with a number of lters can be seen as expanding it in a set of basis functions. The lter outputs are often used as the coordinates of the image in that particular basis. Strictly, this is not correct since lter outputs actually are coordinates in the dual base. Coordinates can be transformed between the base and the dual base using the metric. Readers not familiar with dual bases can turn to [82] for an introduction. For orthonormal bases the dierence is academic since the metric is an identity matrix, but for non-orthonormal bases it is important. Note that ATTENTION CONTROL USING NORMALIZED CONV. 132 an orthonormal basis can, locally, turn non-orthonormal due to variations in the signal certainty. The normalized convolution scheme generates the coordinates, f i , corresponding to the base, bi . In order to be able to compare with the lter outputs from ordinary ltering the coordinates have to be transformed into dual coordinates fi . As mentioned above this is done acting with the metric on the coordinates. Since G,1 has compensated for variations due to the signal certainty, the metric that corresponds to full certainty should be used: 0 ab b c : : : ab b c 1 N CA .. ... G =B @ ... . abN b c : : : abN bN c 1 1 0 1 0 1 0 0 (5.4) 0 where c0 is a constant function equal to one. The dual coordinates are then: 0f 1 [email protected] ... CA = G f0N 0f 1 [email protected] ... CA = G G, 1 01 0 fN 0 1 0 ab cf 1 [email protected] ... CA abN cf 1 (5.5) Note that setting the input signal certainty equal to one gives G0 G,1 = I which in turn gives f0i = abi cf = fi which corresponds to the standard convolution in Equation (5.1). Information about the output signal certainty is captured in the determinant of G. Although it is not necessary, it is convenient to have the output certainty in the interval [0; 1], which is accomplished by normalizing with the determinant of G0 . It is also desirable that the output certainty is equivariant with the input certainty, which means that multiplying the input certainties with a factor results in a multiplication of the output certainties by the same factor: cin ! cout (5.6) but det(M ) = N det(M ) (5.7) 5.2 NORMALIZED CONVOLUTION 133 where N is the dimension of the matrix M . Having N basis functions, the following output certainty has been shown to work well: det G N cout = det G 0 1 (5.8) Converting a ltering operation to NC Rewriting a ltering operation as a normalized convolution can be done according to the following cookbook recipe: 1. Dene an applicability function. The applicability function, a, has to be the same for all lters. Often it is an ordinary window function, such as a Gaussian or a squared cosine. 2. Dene the basis functions. Or rather, dene the windowed basis functions, abi . These are the lters from the original operation. If there is no DC-lter in the original lter set, it has to be added (cf. Section 5.3). 3. Dene a signal certainty function. Sometimes it is easy to dene a signal certainty, i.e. for laser ranging using time-of-ight, where the returning light intensity can be used. Points where no signal is captured are then given zero certainty. If the certainty is not given by the device generating the signal, some sort of certainty measure has to be constructed. 4. Dene the signal. As in the basis function case, it is actually the signal multiplied by the certainty, cf , that is used. 5. Dene the scalar product functions. In order to measure the scalar product between the basis functions, a set of lters with all pair-wise combinations of basis functions, bi bj ; i; j 2 f1; ::; N g, has to be generated. In the following section these steps are applied to performing normalized convolution using quadrature lters. ATTENTION CONTROL USING NORMALIZED CONV. 134 5.3 Quadrature lters for normalized convolution In earlier chapters quadrature lters are used for orientation and local phase estimation. Some of the lters presented in Chapter 2 are not true quadrature lters, but the analysis below will be valid for them as well. A quadrature lter can be seen as consisting of either complex or real basis functions. Both ways are described below, taking the non-ring lter in Equation (2.32) as an example. 5.3.1 Quadrature lters for NC using real basis functions NC quadrature ltering can be made using three real basis functions; a constant function and the real and imaginary parts of the original lter. The applicability function is a spatial localization function corresponding to a windowing function. Here it is chosen to: 8 <cos kf ()k = : 0 2 R 2 if k k < R otherwise (5.9) since this is the magnitude function of the original lter in Equation (2.32). By this choice of applicability function we get the following basis functions: b () = 1 1 (5.10) 2 (5.11) 3 (5.12) b () = cos + sin R R b () = sin R + sin R where b2 and b3 are the real and imaginary part of the lter function. In addition to the complex basis function a constant basis function is required. The reason for this is that since the original lter is insensitive to the mean value of a signal, it is incapable of estimating the signal 5.3 QUADRATURE FILTERS FOR NORMALIZED CONV. 135 certainty level. For example, any constant certainty eld would give zero output. The basis functions can be considered to have innite support since they are always multiplied by the applicability function when constructing the lters needed: ha () = ab1 = cos2 2R 2 hax () = ab2 = cos 2R cos R + sin R + sin hay () = ab3 = cos2 2R sin R R (5.13) (5.14) (5.15) hax and hay are the non-ring lters used in the original operation. In addition to the lters in Equations (5.13), (5.14) and (5.15), three product lters have to be generated: hax () = ab1b1 = cos2 2R cos2 + sin R R 2 (5.16) + sin (5.17) hay () = ab2 b2 = cos2 2R sin2 R R 2 haxy () = ab1b2 = cos 2R cos R + sin R sin R + sin R (5.18) 2 These three lters are needed to measure how the correlation between the basis functions varies with the signal certainty function. This is necessary in order to adjust for the signal variations induced by the certainty function. Figure 5.1 on the next page shows the lters having radius R = 9. The lters outputs are combined according to Equations (5.2) and (5.3): 0 1 0 1 f h a cf [email protected] CA = G, [email protected] cf CA f hay cf 1 2 3 1 (5.19) ATTENTION CONTROL USING NORMALIZED CONV. 136 ax 0.4 ay 0.4 0.2 0.2 0 0 −0.2 −0.2 −8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 ξ a 0.4 0 2 4 6 8 2 4 6 8 2 4 6 8 ξ axx 0.4 0.2 0.2 0 0 −0.2 −0.2 −8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 ξ ayy 0.4 0 ξ axy 0.4 0.2 0.2 0 0 −0.2 −0.2 −8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 ξ Figure 5.1: Non-ring NC quadrature lters. 0 ξ 5.3 QUADRATURE FILTERS FOR NORMALIZED CONV. with 1 0 h a c hax c hay c G=B @hax c hax c haxy cCA hay c haxy c hay c 137 (5.20) 2 2 G is symmetric since all functions are real. If the basis functions are complex, G is Hermitian. The scheme results in six lters and nine dierent convolutions instead of two lters with one convolution each. Furthermore, a 3 3 matrix has to be inverted, possibly for each point. 5.3.2 Quadrature lters for NC using complex basis functions In this section NC quadrature ltering using complex basis functions instead of real ones is discussed. It is actually one real and one complex basis function. Using the same applicability function as in Equation (5.13) gives the following complex basis functions: b1() = 1 (5.21) b2() = ei( R +sin R ) (5.22) The complex lter that corresponds to ab2 can be realized as the two real lters in Equations (5.14) and (5.15), i.e. the original even-odd lter pair. Since one of the basis functions is constant, the correlation lters are the same as the original ones. The complex versions of Equations (5.19) and (5.20) are: ! f 1 = G,1 ha cf f2 (hax + ihay ) cf ! and, using ab2 b2 = a, (5.23) ! ha c (hax , ihay ) c G = (h + ih ) c ha c ax ay (5.24) 138 ATTENTION CONTROL USING NORMALIZED CONV. The total number of lters is three. Counting one complex convolution as two real makes six dierent convolutions. The computational complexity is reduced further by the fact that the G is a 2 2 matrix, instead of 3 3, which facilitates the NC combination of lter results. 5.3.3 Real or complex basis functions? It is evident that using complex basis functions reduces the computational cost signicantly. The real case requires nine convolutions while the the complex case only needs six. An important question to ask is how the two approaches dier? The dierence is best illuminated with two examples. In the rst example, data is missing, or marked unimportant, in single points. This resembles sensor element failure or some kind of point-wise data drop-out. In the second example, regions larger than the lters are missing. This typically happens around the edges of an image, or when masking an image structure for focus of attention purposes. The test signal is the same as in Section 2.1 and the lters are the ones in Figure 5.1 on page 136. The performance of the two approaches is measured by comparing their output phase with the lter output from an un-corrupted signal, shown in Figure 5.2 on the next page. Let z denote the reference signal, i.e. the complex lter response from the original lter on the un-corrupted signal, and let z~ denote the corresponding response from normalized convolution on the signal with partially missing data. The denition of the NC lter response is: 8 <c (f + if ) if real basis functions z~ = : o 2 3 (5.25) co f2 if complex basis functions where co is the output certainty given by Equation (5.8). The dierence in phase between the reference signal and the tested signal is calculated as: di = zi z~i (5.26) The magnitude of di contains the reference signal magnitude, the estimated signal magnitude and the certainty of the NC estimate. This makes 5.3 QUADRATURE FILTERS FOR NORMALIZED CONV. Intensity 1 0.75 0.5 0.25 0 0 20 40 60 80 100 120 140 reference mag. reference phase ξ π π/2 0 −π/2 −π 0 20 40 60 80 100 120 140 20 40 60 80 100 120 140 1 0.75 0.5 0.25 0 0 Figure 5.2: Original input signal and quadrature lter output. 139 140 ATTENTION CONTROL USING NORMALIZED CONV. it suitable for weighting the errors, since errors in points with both high signal amplitude and high NC certainty are more serious than others. The following four statistics are used: n X m = 1 arg(d ) (5.27) n i=1 s2 = n ,1 1 n X mw = i=1 n X s2w = i=1 i n X i=1 (arg(di ) , m)2 kdi k arg(di ) n X kdi k (5.28) (5.29) i=1 (kdi k arg(di ) , mw )2 n X i=1 kdi k (5.30) Point-wise missing data In this example, data is point-wise missing and the signal certainty is set to zero in these points. The two top rows of plots in Figure 5.3 on page 142 show the signal and the certainty function. The line at = 20 and the edge at = 60 are missing a sample at the center of the image feature. The line at = 100 and the edge at = 140 have missing data in the neighborhood of, but not at, the actual feature. The two middle rows Figure 5.3 show the quadrature lter outputs and the error plots. The left and right columns correspond to the real and complex basis function respectively. The output certainty is plotted in the fth row. The bottom plots in both gures show the dierence between the estimated phase and the true phase. The error is plotted for kdk > 0:01 only. The dierence is close to zero which shows that both methods handle the missing data well. The error statistics are listed in Table 5.1, which 5.3 QUADRATURE FILTERS FOR NORMALIZED CONV. 141 shows that the performance of the two methods is almost equivalent when looking at the absolute errors. m mw s sw Real basis functions -0.007 -0.009 0.275 0.030 Complex basis functions -0.004 -0.016 0.196 0.054 Table 5.1: Statistics of the angular error in radians for phase estimation on point-wise missing data ATTENTION CONTROL USING NORMALIZED CONV. 142 1 certainty certainty 1 0.75 0.5 0.25 0 0 20 40 60 80 0.75 0.5 0.25 0 0 100 120 140 20 40 60 ξ 1 Intensity Intensity 1 0.5 0.25 20 40 60 80 0.75 0.5 0.25 0 0 100 120 140 20 40 60 π π/2 0 −π/2 40 60 80 cNC magnitude 0.75 0.5 0.25 20 40 60 80 100 120 140 1 0.75 0.5 0.25 0 0 20 40 60 80 π/2 0 −π/2 0 20 40 60 80 100 120 140 0 0 20 40 60 80 100 120 140 20 40 60 80 100 120 140 20 40 60 80 100 120 140 20 40 60 80 100 120 140 1 0.75 0.5 0.25 0 0 1 0.75 0.5 0.25 0 0 100 120 140 π −π 100 120 140 −π/2 −π 100 120 140 cNC certainty NC magnitude NC certainty 20 1 0 0 NC Phase error cNC Phase π π/2 0 80 ξ cNC Phase error NC Phase ξ −π 100 120 140 ξ 0.75 0 0 80 π π/2 0 −π/2 −π 0 Figure 5.3: Results from normalized convolution using quadrature l- ters with point-wise missing data. Compare the phase and magnitude plots with Figure 5.2 on page 139. The bottom plots show the dierence between the reference phase and the NC phase. Left: Real basis functions. Right: Complex basis functions. 5.3 QUADRATURE FILTERS FOR NORMALIZED CONV. 143 Large missing regions In the test above the lters reach over the points with missing data. Only a few samples under the lter are missing for each point. In the following test the regions of missing data are larger than the lters. The two top rows of plots in Figure 5.4 on the next page show the signal and the certainty function. As above, the line at = 20 and the edge at = 60 are cut at the center, while the line at = 100 and the edge at = 140 have missing data in the neighborhood. When having point-wise missing data, the proper behavior of the phase estimate is often easy to dene. When large regions are missing, however, it is not so easy to tell. There is no way to determine the shape of the signal in the masked, or missing region. Take for instance the line at = 20 and the edge at = 60 in Figure 5.4 on the following page. After masking with the input certainty they look similar. Therefore the phase estimates also look the same. When ltering at the edge of a certainty gap that is broader than the lter, NC extrapolates according to the basis functions. In the quadrature lter case this means that the phase cycles with the same angular velocity as the impulse response. This behavior gives an appropriate phase extrapolation for the line at = 20 but gives a large error for the edge at = 60. The output certainty is small for these regions which explains the dierence between the unweighted and the weighted statistics. Again the performances of the two approaches are similar. m mw s sw Real basis functions 0.168 0.114 0.630 0.280 Complex basis functions 0.063 0.136 0.611 0.366 Table 5.2: Error statistics for phase estimation partially missing data Discussion on basis choice Both the approach with real basis functions and the one with complex basis functions yield quadrature lter outputs that are close to the out- ATTENTION CONTROL USING NORMALIZED CONV. 144 1 certainty certainty 1 0.75 0.5 0.25 0 0 20 40 60 80 0.75 0.5 0.25 0 0 100 120 140 20 40 60 ξ 1 Intensity Intensity 1 0.5 0.25 20 40 60 80 0.75 0.5 0.25 0 0 100 120 140 20 40 60 π π/2 0 −π/2 40 60 80 cNC magnitude 0.75 0.5 0.25 20 40 60 80 100 120 140 1 0.75 0.5 0.25 0 0 20 40 60 80 π/2 0 −π/2 0 20 40 60 80 100 120 140 0 0 20 40 60 80 100 120 140 20 40 60 80 100 120 140 20 40 60 80 100 120 140 20 40 60 80 100 120 140 1 0.75 0.5 0.25 0 0 1 0.75 0.5 0.25 0 0 100 120 140 π −π 100 120 140 −π/2 −π 100 120 140 cNC certainty NC magnitude NC certainty 20 1 0 0 NC Phase error cNC Phase π π/2 0 80 ξ cNC Phase error NC Phase ξ −π 100 120 140 ξ 0.75 0 0 80 π π/2 0 −π/2 −π 0 Figure 5.4: Results from normalized convolution using quadrature l- ters with data missing in large regions. Compare the phase and magnitude plots with Figure 5.2 on page 139. The bottom plots show the dierence between the reference phase and the NC phase. Left: Real basis functions. Right: Complex basis functions. 5.4 MODEL-BASED HABITUATION/INHIBITION 145 puts from the uncorrupted signal. It is important not only to estimate accurately, but also to know when not doing so. Since weighted errors are smallest for the approach with real basis functions, it is the best approach from this point of view. On the other hand, the dierence in accuracy between the approaches is not proportional to the dierence in complexity, especially when the missing regions are smaller than the lters. In a scale-space implementation there is likely a level of resolution where the lters reach over the regions with missing data. The conclusion is that complex basis functions give the best performance-complexity ratio for quadrature ltering using normalized convolution. 5.4 Model-based habituation/inhibition Normalized convolution can be used for disregarding image structures by setting the signal certainty to zero for these structures. This ability makes NC suitable for directing the focus of attention of an active vision system. Take for instance a robot system with an arm and a camera head. The system is supposed to react on objects entering the scene, or on unpredicted motion in the vicinity of the robot. However, it does not need to react on known objects such as its own arm. This may seem as a simple problem, but in a hierarchical system, consisting of a number of more or less independent processes, the impact of known structures is hard to suppress. The dierent low level processes do not, and can not, have any knowledge about which structures are already known and which are not. The higher levels, on the other hand, can provide information about this. In the system presented here, high level processes generate certainty masks that are set to zero certainty in areas with known structures. The certainty masks can be generated from estimated information of object geometry and/or from estimated local image features such as for example color, velocity or orientation. This means that, for instance, all blue objects can be neglected using this technique. 146 ATTENTION CONTROL USING NORMALIZED CONV. 5.4.1 Saccade compensation Saccades, i.e. fast camera motions made in order to xate a new point of interest, normally introduce strong erroneous responses after a saccade for a period of time that depends on the lter size in the time dimension. Note that merely shutting o the camera during the saccade introduces strong responses in any ltering having temporal extent when the camera is turned back on. However, such errors can be almost completely eliminated using normalized convolution. When the head makes a saccade, the certainty for the whole eld of view is set to zero as long as the head moves. The input frames are then considered to be missing data and do not inuence the motion estimation. Both the examples below contain saccade compensation. 5.4.2 Inhibition of the robot arm inuence on low level image processing Figure 5.5 on the next page shows a system consisting of a simulated Puma 560 arm and a stereo camera head mounted on a neck centered over the \waist joint". Objects are produced in the machine to the left and a conveyor belt transports them towards a bin to the right. Figure 6.3 on page 161 shows a blueprint of the scene. Figure 5.8 on page 150 shows every 10th frame of a sequence where an object passes the robot on the conveyor belt, while the robot arm moves back and forth along the belt. The two leftmost columns in Figure 5.8 on page 150 show an overview and the right camera view. The head performs saccadic tracking, i.e. when the object is too far from the center of the image the head rapidly changes xation point to center the object. Consequently, the head is not moving continuously. A method based on three-dimensional quadrature lters that can track object by smooth pursuit is presented in [43] but is at present not implemented with normalized convolution. 5.4 MODEL-BASED HABITUATION/INHIBITION 147 Figure 5.5: A puma 560 robot arm by a conveyor belt. The robot has a neck with a stereo camera head cf. the cover. Certainty mask Low Lowlevel level processing processing Puma PumaCTRL CTRL Head HeadCTRL CTRL Puma Pumaarm arm Head Head Arm Armmodel model Arm mask Cameras Cameras Figure 5.6: Block diagram for model based inhibition of a puma robot arm. 148 ATTENTION CONTROL USING NORMALIZED CONV. The third column of Figure 5.8 shows the pairwise temporal dierences between consecutive frames. The head tracks the local center of mass of the temporal dierences. The tracking is similar to the method for locating symmetries described in Subsection 4.4.7. When the head is not moving and the arm is out of sight, the object appears clearly and is easy to track. When the head makes a saccade, as in the fourth row of Figure 5.8, the whole eld of view changes which makes tracking impossible. In the last three rows the robot arm becomes visible and dominates the eld of view, which makes the head track the arm instead of the object. Both the saccade and the robot arm are examples of known events that should be inhibited from eecting the pre-attentive gaze control. Normalized convolution is applicable for both these problems. When the head makes a saccade, the certainty for the whole eld of view is set to zero as long as the head moves. The input frames are then considered to be missing data and do not inuence the motion estimation. In order to generate a certainty mask that cancels the motion estimates from the robot arm, a model of the arm is needed. The position of the arm in relation to the camera and the joint angles for both the arm and head are also necessary. The block diagram in Figure 5.6 on the page before shows a system that takes the arm and head control parameters and generates a mask that covers the arm. The geometrical model does not need to be precise. A bounding box representation of each link is sucient. The bounding boxes are used for rendering a certainty mask directly into a fovea representation as shown in Figure 5.7. The leftmost image shows the eld of view with the fovea level boundaries marked out for clarity. The other images show the certainty mask for the four levels of the fovea. The fourth column in Figure 5.8 on page 150 shows the arm mask. The fovea is combined as described in Chapter 4. The masked area in the lower part of the three rst images of the fourth column is due the bounding box of the shoulder link. The last column in Figure 5.8 shows the temporal dierences using normalized convolution using the certainty masks in the fourth column. Only the object of interest is visible. The eects of the known events have van- 5.4 MODEL-BASED HABITUATION/INHIBITION 149 Figure 5.7: The certainty mask generated from the bounding box representation of the robot arm. ished. In the last row, the object is behind the robot arm, and there is naturally no way of recovering it. Note that if the bounding box representation is too crude and an object is close to the arm, the object might be masked out although it is visible, cf. Subsection 5.4.4. On the other hand, if the bounding box representation is too tight and the parameters and/or arm position are imperfect, the mask might fail to cover the arm completely. 5.4.3 Inhibition of modeled objects The same procedure as in the last section can be applied to remove low level responses from already modeled structures. Figure 5.9 on page 152 shows every 10th frame of a sequence where some objects pass the robot on the conveyor belt. The two leftmost columns in this gure show an overview and the right camera view. The large cross-hair, in the right camera view, shows the current xation point, and the small cross-hair points out where the head would saccade to if a saccade was to be executed immediately. The third column shows the pairwise temporal dierences between consecutive frames. When a new object is detected it is xated, and a model is invoked. This model contains a bounding box representation, a 3D position and a 3D velocity. When the model is set, a certainty mask is generated which is used for inhibiting responses from the object. The system is then sensitive only to un-modeled events. The masks from the object models are shown in the fourth column, while the last column contains temporal dierences using normalized convolution. 150 ATTENTION CONTROL USING NORMALIZED CONV. Figure 5.8: Inhibition of motion estimate due to known events. 5.4 MODEL-BASED HABITUATION/INHIBITION 151 In the rst two rows the system detects a new object behind the two already modeled objects. When the third object is modeled, the system makes a saccade towards the middle object by canceling the appropriate mask. During the saccade the whole eld of view is given zero certainty. In the four last rows the attention is shifted between the objects by simply canceling the mask of the object to be attended to. With this strategy the low level processes do not need to know the dierence between a new object entering the scene and a mask being canceled, and a complex communication structure between high and low levels is avoided. As long as the constant velocity model is appropriate, the projected certainty mask will cover the modeled structures. However, if the model is too simple, the corresponding certainty mask slides o. If this happens, the previously suppressed low level responses will automatically alarm and the system shifts the attention for updating that particular model. When all moving objects are modeled correctly, the lower levels of the system will be quiet. Only models corresponding to objects that change their behavior alarm and need additional attention. 5.4.4 Combining certainty masks In Subsections 5.4.2 and 5.4.3 the certainty masks are based on geometry only. Although this is often sucient, it sometimes has undesired side eects. The top left image in Figure 5.11 on page 154 shows a situation where an object is almost entirely masked by the arm although it is visible. The mask is generated according to the arm and head geometry only, any objects between the arm and camera are not accounted for. The middle left image in the same gure shows what happens if the mask instead is based on object color only. Regions with the color of the robot arm are masked. The color has to be extracted with methods that compensate for the color of the illumination [56, 55]. The arm is masked correctly but the oor and the little cylindrical object to the left are masked as well. The solution in this case is shown in the bottom image of Figure 5.11, where the two previous masks are combined. The total mask should only be zero for red objects within the bounding box representation of the arm, thus 152 ATTENTION CONTROL USING NORMALIZED CONV. Figure 5.9: Inhibition of already modeled objects. 5.5 DISCUSSION 153 Figure 5.10: The robots view of the conveyor belt with a cylindrical object and a box. they are combined with a logical OR. By combinating masks from dierent models or image features a more selective masking can be performed. 5.5 Discussion It has been shown that normalized convolution can be used for attention control. By suppressing known structures, relatively simple methods can be used in the following processing stages. Guiding attention by means of certainty masks ts in a natural way in a system exploiting the benets from separating signal from certainty [48, 82]. Since normalized convolution is based on a series of convolutions, this scheme can be implemented using special hardware for convolutions. For all levels of a feature hierarchy the convolutions can be executed in parallel, and NC can therefore be as fast as a standard convolution. 154 ATTENTION CONTROL USING NORMALIZED CONV. Figure 5.11: Combination of certainty masks. Top: Geometry only. Middle: Color only. Bottom: Color and geometry. 6 ROBOT AND ENVIRONMENT SIMULATOR A robot vision simulator has been developed in the Application Visualization System, AVS, to facilitate testing of robot vision and control algorithms [86]. The simulator reduces the need for expensive special purpose hardware since it can be run in \slow motion". A real-time process can be investigated although only limited computational resources are available. It also allows testing of dierent types of robots and robot congurations without any extra cost. The scene can be varied from a few very simple polyhedral objects to complex, realistic, texture mapped environments. The simulated reality, such as true 3d-structure, true distance, etc., can easily be compared with the results obtained by the robot vision system in order to evaluate the performance of the algorithms [20, 19]. 6.1 General description of the AVS software The AVS software is a product from Advanced Visual Systems Inc. (AVS Inc.). It is an interactive visualization environment for scientists, engineers and technical professionals. AVS oers an environment that allows users to construct applications incorporating their own code without graphics programming. The package is easy to learn and use, it provides an intuitive interface for quickly designing prototype applications, and it supplies powerful tools for customization and tuning production applications. 155 ROBOT AND ENVIRONMENT SIMULATOR 156 There are two major ways of using AVS. Firstly, AVS has a number of data visualizers called viewers. These are ready-to-use visualization packages for a variety of data types. { Image viewer is an interactive 2D image processing and display package. { Geometry viewer is an interactive 3D geometric data renderer. { Graph viewer is a tool for plotting functions, measurements, statistics using line, bar, area, scatter and contour plots. Secondly, AVS can be used as a prototyping tool. New algorithms and methods can be designed and tested using visual programming, i.e. choosing and interconnecting program modules graphically as shown in Figure 6.1 on the next page. The prototyping subsystems are: { Network editor, a visual programming tool. { Module generator, an integration tool for user-supplied code. { Layout editor, a graphical user interface design tool. { Command Language Interpreter, a scripting language and callable interface. AVS users can construct their own visualization applications, by combining software components into executable ow networks. The components, called modules, are sub-programs or functions which are called by the AVS kernel. The ow networks are built from a menu of modules by using direct manipulation, a visual programming interface called the AVS Network Editor. With the Network Editor, the user produces an application by selecting a group of modules and drawing connections between them. The AVS Command Language Interpreter (CLI) is a text language that can drive most of the AVS system. It can be used for making animations, 6.1 GENERAL DESCRIPTION OF THE AVS SOFTWARE 157 Figure 6.1: The AVS network editor. The window to the left is the network control panel. This is the default location for module widgets. The large window is the network construction window. The upper part contains a network editor menu to the left, and a palette of modules in the current library to the right. The rest of the network construction window is a workspace. It contains a sample network that reads an image from a le, displays it, runs a sobel operator and displays the result. save networks, remote control, etc, Any module can pass CLI commands to the AVS kernel, i.e. any module can modify parameters, rearrange networks, invoke and delete other modules, etc. AVS includes a rich set of modules for the construction of networks. It allows users to create their own new modules to meet their specic needs and dynamically load them into networks. A module generator can be used to automatically generate module code in FORTRAN or C. Both ANSI{C prototypes and C++ classes are available. The module generator creates skelletons for Makeles and manual pages as well. ROBOT AND ENVIRONMENT SIMULATOR 158 6.1.1 Module Libraries About 150 standard supported modules are included in the AVS software provided by AVS Inc. In addition to this, a number of share ware modules are available. The International AVS Center, housed at the North Carolina Supercomputing Center, is a center for collection and distribution of new modules. Hundreds of user-contributed and ported public domain modules are available there and can be retrieved via ftp (including documentation). Examples of recently ported public domain modules are Khoros (approx. 250 modules) and Sunvision. A database facilitates searching for modules suited for certain applications. The center also distributes a magazine quarterly, the AVS Network News magazine, presenting novelties concerning AVS software and applications. The module library developed at the Computer Vision Laboratory contains about 100 modules developed specially for image processing problems. The robot simulator described below is a part of this library. 6.2 Robot vision simulator modules Puma 560 Puma Dynamics Puma Inverse Transformations Stereo Camera Stereo Head Dynamicscamera simulation Invert DH Matrix Monopod Monopod Dynamics Multiply DH Matrices Vacuum Gripper Conveyor belt scene Figure 6.2: The basic modules used in the Robot Vision Simulator. Figure 6.2 shows a set of modules for geometric visualization and control of robots and tools. These are the basic modules in the robot vision simulator. In the actual experiments a number of situation specic modules are used for image processing, path planning, etc. Section 6.3 describes an experiment where the modules below are used as building blocks together with a number of special purpose modules. 6.2 ROBOT VISION SIMULATOR MODULES 159 Puma 560 The \Puma 560 " module generates a geometric description of a Puma 560 robot arm with six degrees of freedom (6 DOF). The robot can be positioned by supplying a transformation matrix to one of the input ports. The robot pose is controlled via the joint angle parameters. There are basically three ways of furnishing joint angles. First, they can be given manually using the widgets on the control panel. Second, a controlling module can be attached to the remote control input port, passing joint angles. Finally, there is also a possibility for a module to access the joint parameter widgets using the \AVScommand" function call. The module does not simulate the dynamics of the robot arm. In order to implement robot dynamics and/or other forms of control parameters, a lter, e.g. \Puma dynamics ", can be inserted between the module supplying the desired joint parameter values and the \Puma 560 " module. The transformation matrix describing the location and orientation of the actuator end point is presented on an output port, enabling attachment of dierent tools, e.g. \vacuum gripper " or \Stereo Camera ". In fact, it is possible to put one robot arm at the end of another by connecting the output port of one robot to the input port of another. Stereo Camera \Stereo Camera " creates a geometric description of a binocular camera head with 4 DOF. The cameras have independent pans, a variable baseline, and a variable eld of view. The camera parameters are manipulated in the same manner as the robot arm. The camera head is attached to an actuator by connecting the transformation matrix input of \Stereo Camera " to the corresponding output of \Puma 560 " or \Monopod ". The transformation matrix of the actuator end point is then transferred to the camera head, making it follow the actuator movements. 160 ROBOT AND ENVIRONMENT SIMULATOR Note that \Stereo Camera " has three output geometry ports. One \geometry viewer " module can render any number of views from the same scene, but only one can be presented on the output port. Therefore, two geometry viewers are needed to get a stereo image pair and one extra is used for overview images. Consequently, the world representation is kept in three separate copies. Monopod \Monopod " is a camera head platform, or neck, with 3 DOF: pan, tilt and elevation. The hand of the robot arm and the neck have identical geometry. The platform is positioned by furnishing a transformation matrix on one of the module input ports. If the same transformation matrix is used for both the robot arm and the monopod neck, the result is the conguration in Chapter 5. The neck appears to be mounted on the \shoulder link" of the robot (see cover). Vacuum Gripper \Vacuum Gripper " is intended to be used together with \Conveyor belt scene " below. It creates a geometric description of a suction cup to be positioned at the end of a robot arm. As the name indicates, the function is the same as lifting small objects with a vacuum cleaner hose. The module has a boolean parameter indicating whether the gripper is active or not. When the suction is activated, the transformation matrix is sent to the output port. Conveyor belt scene \Conveyor belt scene ", shown in Figure 6.3, has two conveyor belts, each connected to a machine. One machine produces objects, hence called the Producer, and puts them on a conveyor belt. The selection of type of object and creation time can either be externally controlled, or at random. 6.2 ROBOT VISION SIMULATOR MODULES 161 80 cm hgh: 16 cm Large bin 80 cm 40 cm hgh: 40 cm (150,209,0) (48,150,0) Small bin 40 cm 40 cm hgh: 60 cm Robot Position (85, 100, 0) 85 cm Table 60 cm (75,30,0) 300 cm (0,0,0) Consumer 80 cm hgh: 60 cm (150,0,0) Producer 100 cm 80 cm hgh: 120 cm Figure 6.3: The scene generated by \Conveyor belt scene ". The dashed circle indicates the working area of the robot in the experiment in Chapter 5, but there are no limitations as to where to position it. 162 ROBOT AND ENVIRONMENT SIMULATOR At the end of this belt there is a bin collecting manufactured objects. The other belt transports objects placed on it to the other machine, the Consumer, where they are consumed. The \Vacuum Gripper " output can be attached to the module enabling manipulation of the manufactured objects. For the robot to lift an object, it has to position the gripper close to the bounding box of the object, orient it perpendicularly to the surface, and activate the suction. At present the simulation of physical objects and their interactions is constrained to a few special situations. Objects are transported along the conveyor belts when put on them and fall to the oor if released in the air. The table, or the oor, can be used for temporary storage of objects. Objects can, however, be moved through each other. Piling objects is thus not possible. Puma Dynamics, Stereo Head Dynamics and Monopod Dynamics \Puma Dynamics ", \Stereo Head Dynamics ", \Monopod Dynamics " are meant to implement a dynamic model of the robot arm, the camera head, and the monopod respectively. For the time being, the modules restrain the maximum speeds with which the joints can rotate. This feature is important, for instance when using the robot to catch the objects in the conveyor belt scene. Not being able to move instantly from one point in the workspace to another, the system is forced to use some kind of predictor. Puma Inverse \Puma Inverse " calculates the inverse kinematics [28], translating (if possible) the desired transformation matrix of the end eector to robot joint angles. If both the robot transformation matrix and the end eector matrix are presented, the latter is interpreted as being in world coordinates. If only the end eector matrix is furnished, it is interpreted in robot centered coordinates. 6.2 ROBOT VISION SIMULATOR MODULES 163 Camera simulation Figure 6.4: Motion blur created by the \camera simulation " module. \camera simulation " is a module for introducing noise and motion blur to the images (Figure 6.4). The blur is controlled with a decay rate parameter 0 1, where = 1 means no blur at all and = 0 means no image update, i.e. a frozen image. The noise is controlled with a corresponding factor 0 1 where = 0 means no noise and = 1 means noise only. The function for generating the output image at time t is: iout (t) = (1 , ) iin (t) + (1 , )iout (t , t) + N (t) (6.1) where iout (t , t) is the preceding output image, and N (t) is a white noise image. The pixel values in the noise image are uniformly distributed in the interval [0; 1[. 164 ROBOT AND ENVIRONMENT SIMULATOR Transformations \Transformations " produces a transformation matrix in homogeneous coordinates from translation, rotation, and scale parameters. The module is used for positioning the robot, monopod, etc. 6.3 Example of an experiment The following example shows how a complex system can be built in AVS, see gure 6.5. These AVS networks are used in the experiments described in Chapter 5. animated float Virtual reality Robot Vision Object tracker dT Occular Reflexes Figure 6.5: This AVS network corresponds to the experiments described in Chapter 5. The \Virtual reality " and \Robot Vision " are macro modules that correspond to the networks in Figure 6.6 on page 166 and Figure 6.7 on page 168 respectively. The upstream connections to the \Virtual reality " module from \Occular reexes " are here made invisible in order not to clutter the network. The net is driven by \animated oat " acting as a clock pulse, activating connected modules. Modules not connected to the clock execute when they are called by upstream modules. Note that a module connected both 6.3 EXAMPLE OF AN EXPERIMENT 165 to the \animated oat " module and other upstream modules normally wait for the upstream modules to nish before executing, but this can be overridden by the programmer. The \Virtual reality " module simulates the environment and generate images, while the \Robot Vision " module analyses these images. These modules are macro modules that contain whole networks. They are therefore separately described in more detail in Subsection 6.3.1. \Object tracker dT " extracts the image point that corresponds to the centroid of the temporal dierences in each of the right and left images. The tracking is similar to the method for locating symmetries described in Subsection 4.4.7. The output from the module is the vector to the centroid in image coordinates for each image. \Occular reexes " orients the head towards the moving object according to the information from the \Object tracker dT " using the tilt and pan joints on the monopod. Exactly the same control strategy as in Subsection 4.4.8 is used since the hand of the Puma 560 is identical to the neck of the monopod. The module also verges the cameras to look at the same point according to the disparity estimates. Note that the upstream connections to the \Virtual reality " module from \Occular reexes " are invisible, but the input ports where they attach are visible. The reason for this is purely esthetic. 6.3.1 Macro modules Macro modules is a way of organizing large networks into logical clusters of modules. Collecting a number of modules into a macro module does not eect the execution order of the modules. From the scheduler's point of view all modules might as well be connected in one large network. ROBOT AND ENVIRONMENT SIMULATOR 166 IN-> Virtual reality Transformations Puma Inverse Puma 560 Monopod Vacuum Gripper Conveyor belt scene Stereo Camera geometry viewer geometry viewer antialias antialias Scheduler BUG workaround OUT-> Virtual reality Figure 6.6: The network corresponding to the \Virtual reality " macro module in Figure 6.5 on page 164. 6.3 EXAMPLE OF AN EXPERIMENT 167 Virtual reality \Virtual reality " is a macro module containing the network shown i Figure 6.6 on the facing page. It generates the geometric description of the conveyor belt scene, the robot arm, the camera head etc. It also renders the images from the stereo camera head using the module \geometry viewer ". The \antialias " modules reduce the aliasing eects in the images by means of low pass ltering and subsampling. The ruggedness of edges and lines typical for computer generated images is then reduced. Note that the images has to be rendered with twice as high resolution as needed. \Conveyor belt scene " manages the transformation of manufactured objects and handles robot interaction. Note that it is connected to the clock pulse since it calculates the dynamic behavior of the objects and therefore needs to know the time. The robot hand can be positioned and oriented using any module producing a transformation matrix, e.g. \Transformations ". \Puma Inverse " calculates the inverse kinematics, translating (if possible) the desired transformation matrix of the end eector to robot joint angles. The \Scheduler BUG workaround " module makes the network wait for all modules in \virtual reality " before continuing. Appendix A describes why it is necessary. Robot Vision Figure 6.7 shows the network corresponding to the \Robot Vision " macro module. Log-Cartesian fovea representations of the luminance of the left and right images are created in the \Float luminance " and \Create fovea " modules. Certainty masks are generated in \model based inhibition " both for the robot arm and for modeled objects as described in Chapter 5. Phase based disparity estimates are generated by the \Robot NDCfovea stereo " module. The module uses a fovea version of the method described ROBOT AND ENVIRONMENT SIMULATOR 168 IN-> Robot Vision Float luminance Float luminance Create fovea model based inhinibition Mask signal Mask signal fovea dT NDC oneori Create fovea NDC oneori Fovea distmass fovea dT Fovea distmass Robot NDCfovea Stereo OUT-> Robot Vision Figure 6.7: The sub-network corresponding to the \Robot Vision " macro module in Figure 6.5 on page 164. 6.4 SIMULATION VERSUS REALITY 169 in Chapter 3 in combination with the normalized convolution scheme presented in Chapter 5. A vector eld pointing towards the centroid of temporal dierences is created by the \fovea dT " and \Fovea distmass ". The vector eld can be interpreted as the gradients of a potential eld around moving objects. 6.4 Simulation versus reality Simulated environments are powerful tools for developing robot vision algorithms. New robot types and congurations can be tested without having to buy new expensive hardware. Interface, electrical and control problems can be bypassed. Ground truth is known. Dynamic events can be run in slow motion, and so forth. This blessing might, on the other hand, be a curse. Algorithms working perfectly in the simulated environment might fail completely on natural images. Control strategies might break down due to latencies [12] in a real system that did not exist in the simulated world, etc. In order to address this problem, all feature extration algorithms have been tested on real images. The \camera simulation " module is an attempt in this direction as well. The control problem is harder to address. The problem of delays in the image processing is still there in a simulated environment. The length of the delay is mostly constant and known, which might not correspond to a real situation. In Figure 6.5 on page 164 the network is synchronous, meaning that all modules are running according to a common clock, dening \real-time" for the entire system. This type of simulation does not reect the diculties of having a real-time running world and dedicated hardware running in its own time connected to general purpose hardware working even slower. Another disadvantage of having a synchronous system is that it forces the higher levels to work with the same resolution in time as the lower levels, something that probably is most inecient. The setup in Figure 6.8 on page 171 is a possible solution to this problem. 170 ROBOT AND ENVIRONMENT SIMULATOR The system is spread over a number of AVS sessions, each with a separate scheduler and thus a separate clock. The sessions execute independently and communicate asynchronously which makes it more realistic than the system in Figure 6.5 on page 164. Real-time control loops are kept in \session A", while more time consuming analysis is carried out in \session B". There is no limit to how many AVS sessions that can be connected. Future research will explore the viability of a system such as in Figure 6.8. 6.5 Summary The robot simulator has played a major role in the research that is the basis for this thesis. It allows the user to gain insight in the problems concerning the generation of robot joint angles from visual input, without getting drowned in control problems. For instance, how camera motion is enabled but also limited by the robot's degrees of freedom. Hand-eyecoordination tasks can be studied as well. One of the demonstrators in the VAP project is built around the conveyor belt scene in Figure 6.3 on page 161. A robot system consisting of an Puma 560 arm, a vacuum gripper, and a stereo camera head is used for inspecting manufactured machine parts. The machine parts are passing by on one conveyer belt and only correctly manufactured objects should be allowed to continue to the bin. Objects with small defects are to be returned on another conveyer belt while severely defective and any other object should be discarded in a special bin. If objects are arriving faster than the system can inspect them, a local stack, the table, can be used. This demonstrator contains both low level reactive behaviors and high level planning. The research issue is to design a multi-level control policy that can handle both expected and unexpected events, and adapt its behavior according to the situation. The experience from these experiments combined with the expertize on real camera heads in other member groups enables faster progress than doing simulation only or real experiments only. 6.5 SUMMARY Asynchronous Interface module 171 AVS session A World Time Real Time Control: − Vergence − Smooth pursuit − Hand−Eye coord Clock Environment Simulation Real Time Image Processing: − Spat−temp filtering − Phasebased stereo − Colour transform Asynchronous Communication AVS session B Object Hypotheses Generation Action Planner World Model Maintanance Image Processing Control Goal Driven FOA Asynchronous Communication Figure 6.8: A system consisting of two AVS sessions combined with asynchronous communication modules. 172 ROBOT AND ENVIRONMENT SIMULATOR A AVS PROBLEMS AND PITFALLS A.1 Module scheduling problems Modules can run locally on the same host or remotely on other hosts, possibly with dierent architectures. Modules may also run in the same unix process saving resources and enhancing overall performance. In Figure 6.7 on page 168 the data ow can be divided into four separate streams which makes it possible to run groups of modules in parallel on four dierent hosts. In the experiments, four SUN IPX machines were used and \Robot NDCfovea Stereo " was run on a high performance number cruncher, Stardent GS2500. AVS handles the scheduling of which modules to execute and when. Generally it works as expected, but there are two serious bugs in the scheduler. The rst bug concerns the \geometry viewer " module. It is a special module in the sense that it is built-in in AVS and not supplied as a separate executable as most other modules. This fact seems to confuse the scheduler. Figure A.1 on the following page shows a network with a \geometry viewer " in the middle. If a new geometry is read by the \read geometry ", the \geometry viewer " executes followed by \eld math " and \display image ", which is the correct behavior. Similarly, if a new image is read by \read image " the desired behavior is that \eld math " waits for \geometry viewer " to nish before executing. Instead the \eld math " 173 AVS PROBLEMS AND PITFALLS 174 read image read geom geometry viewer field math display image Figure A.1: A sample network that executes in an undesired order if a new image is read with the \read image " module. The \eld math " module will execute twice instead of waiting for the \geometry viewer " to nish. module executes before the \geometry viewer ", and then once more when the \geometry viewer " has nished. If the \geometry viewer " had been any other module the network would have executed as expected. The \Scheduler BUG workaround " module in Figure 6.6 on page 166 solves the problem by stopping all data from leaving the \Virtual reality " module before all \geometry viewer " modules have executed. This solution is unsatisfactory since it requires all data to be copied from input to output which is both time and memory consuming. The second scheduler problem concerns parallel module executions and feedback data streams. AVS can schedule parallel execution or feedback data streams but not both. A network containing both parallel execution and feedback data can be made to execute correctly by carefully starting up modules, but it is highly unstable. Any interaction with the network might cause it to start executing in an undesired order. Both these problems might disappear with the new release of AVS (AVS6). The control structure is then completely changed giving more control to the programmer. Time will show. A.2 TEXTURE MAPPING PROBLEMS 175 A.2 Texture mapping problems B A a B A b a b Figure A.2: Left: A texture is mapped onto an object by linear interpolation between the image points a and b. The result is non-equidistant mapping onto the object. Right: A texture is mapped onto an object by linear interpolation between the object coordinates A and B. Texture mapping enhances the naturalistic look of simulated objects since the surface structure of real objects can be captured. It allows having detailed structures even though the objects are dened with a few polygons. Almost all computer graphics systems, AVS included, use linear interpolation in image coordinates for textures. This may make a at surface look warped. The reason for this is shown to the left in Figure A.2, where 176 REFERENCES the image of an object between points A and B is rendered between the image points a and b. The dashed rays show how the texture pixels are projected onto the object when linear interpolation in the image plane is used. The rays go through equidistant points in the image plane but are not depicted on equidistant points on the object surface. The eect can be reduced by making a ner tessellation of the surface which means adding vertices in between A and B. The texture will then be correctly mapped in these points as well, and the distortion between the vertices will be smaller. However, for image sequences, e.g. from moving cameras, this method is not satisfactory. Textures seem to oat around when the camera is moved. In principle it is possible to add the texture when creating the objects and create one polygon vertex for each pixel in the texture map. The objects will then consist of patches colored according to the texture, as opposed to coloring the projected image of the objects. This method yields better results but is very memory and time consuming. The appropriate method for texture mapping is shown to the right in Figure A.2. If the depth to the points is taken into account the texture is mapped onto equidistant points along the object but not onto equidistant points in the image plane. The reason for not using the proper approach in most computer graphics systems is that is more complex, and thus more time consuming. References [1] A. L. Abbott and N. Ahuja. Surface reconstruction by dynamic integration of focus, camera vergence and stereo. In Proceedings IEEE Conf. on Computer Vision, pages 523{543, 1989. [2] E. H. Adelson and J. R. Bergen. Spatiotemporal energy models for the perception of motion. Jour. of the Opt. Soc. of America, 2:284{ 299, 1985. [3] J. Y. Aloimonos, I. Weiss, and A. Bandopadhay. Active vision. International Journal of Computer Vision, 1(3):333{356, 1987. [4] R. Bajcsy. Passive perception vs. active perception. In Proc. IEEE Workshop on Computer Vision, 1986. [5] R. Bajcsy. Active perseption. Proceedings of the IEEE, 76(8):996{ 1005, August 1988. [6] D. H. Ballard. Animate vision. Technical Report 329, Computer Science Department, University of Rochester, Feb. 1990. [7] D. H. Ballard and A. Ozcandarli. Eye xation and early vision: kinetic depth. In Proceedings 2nd IEEE Int. Conf. on computer vision, pages 524{531, december 1988. [8] S. T. Barnard and M. A. Fichsler. Computational Stereo. ACM Comput. Surv., 14:553{572, 1982. [9] J. Bigun. Local Symmetry Features in Image Processing. PhD thesis, Linkoping University, Sweden, 1988. Dissertation No 179, ISBN 91{ 7870{334{4. [10] R. Bracewell. The Fourier Transform and its Applications. McGrawHill, 2nd edition, 1986. 177 178 REFERENCES [11] C. M. Brown. The Rochester robot. Technical Report 257, Computer Science Department, University of Rochester, Aug. 1988. [12] C. M. Brown. Gaze control with interactions and delays. IEEE systems, man and cybernetics, 20(1):518{527, march 1990. [13] C. M. Brown. Prediction and cooperation in gaze control. Biological cybernetics, 63:61{70, 1990. [14] K. Brunnstrom. Active exploration of static scenes. PhD thesis, Royal Institute of Technology, October 1993. ISRN KTH/NA/P-93/29-SE, ISSN 1101-2250. [15] K. Brunnstrom, J. O. Eklundh, and T. Lindeberg. Active detection and classication of junctions by foveating with a head{eye system guided by the scale{space primal sketch. Technical Report TRITIANA-P9131, CVAP, NADA, Royal Institute of Technology, Stockholm, Sweden, 1990. [16] A. D. Calway, H. Knutsson, and R. Wilson. Multiresolution estimation of 2-d disparity using a frequency domain approach. In Proc. British Machine Vision Conf., Leed, UK, September 1992. [17] A. D. Calway, H. Knutsson, and R. Wilson. Multiresolution frequency domain algorithm for fast image registration. In Proc. 3rd Int. Conf. on Visual Search, Nottingham, UK, August 1992. [18] A. Chehikian and J. L. Crowley. Fast computation of optimal semi{ octave pyramids. In Proceedings of the 7th scandinavian Conf. on image analysis, pages 18{27, Aalborg, Denmark, 1991. Pattern recognition of Denmark and Aalborg University. [19] C. C. Chen and M. M. Trivedi. Savic: A simulation, visualization, and interactive control environment for mobile robots. In H.I. Christensen, K.W. Bowyer, and H. Bunke, editors, Active robot vision: camera heads, model based navigation and reactive control, volume 6 of series in machine perception and artical intelligence, pages 123{ 144. World scientic publishing Co. Pte. Ltd., 1993. ISBN 981-021321-2. REFERENCES 179 [20] ChuXin Chen and Mohan M. Trevedi. Mobile robots with articulated tracks and manipulators: Intelligent control and graphical interface for teleoperation. In Mobile Robots VII, volume 1831, pages 592{603. SPIE, 1992. [21] H.I. Christensen, K.W. Bowyer, and H. Bunke, editors. Active robot vision: camera heads, model based navigation and reactive control, volume 6 of series in machine perception and artical intelligence. World scientic publishing Co. Pte. Ltd., 1993. ISBN 981-02-1321-2. [22] J. L. Crowley and H. I. Christensen, editors. Vision as Process, ESPRIT Basic Research Series. Springer-Verlag, 1994. ISBN 3-54058143-X. [23] S. Culhane and J. Tsotsos. An attentinal prototype for early vision. In Proccedings of the 2:nd European Conf. on computer vision, Santa Margharita Ligure, Italy, May 1992. [24] M. M. Fleck. A topological stereo matcher. Int. Journal of Computer Vision, 6(3):197{226, August 1991. [25] D. J. Fleet. Measurement of image velocity. Kluwer Academic Publishers, 1992. ISBN 0{7923{9198{5. [26] D. J. Fleet and A. D. Jepson. Stability of phase information. In Proceedings of IEEE Workshop on Visual Motion, pages 52{60, Princeton, USA, October 1991. IEEE, IEEE Society Press. [27] D. J. Fleet, A. D. Jepson, and M. R. M. Jenkin. Phase-based disparity measurement. CVGIP Image Understanding, 53(2):198{210, March 1991. [28] K. S. Fu, R. C. Gonzales, and C. S. G. Lee. Robotics. McGraw Hill Int. Editions, New York, 1987. [29] D. Gabor. Theory of communication. Proc. Inst. Elec. Eng., 93(26):429{441, 1946. [30] M. Gkstorp and C-J. Westelius. Multiresolution disparity estimation. In Proceedings of the 9th Scandinavian conference on Image Analysis, Uppsala, Sweden, June 1995. SCIA. 180 REFERENCES [31] Mats Gokstorp. Depth Computation in Robot Vision. PhD thesis, Linkoping University. SWEDEN, S-581 83 LINKOPING, SWEDEN, 1995. Dissertation No. 377, ISBN 91-7871-522-9. [32] G. H. Granlund. In search of a general picture processing operator. Computer Graphics and Image Processing, 8(2):155{178, 1978. [33] G. H. Granlund. Integrated analysis-response structures for robotics systems. Report LiTH{ISY{I{0932, Computer Vision Laboratory, Linkoping University, Sweden, 1988. [34] G. H. Granlund and H. Knutsson. Signal Processing for Computer Vision. Kluwer Academic Publishers, 1995. ISBN 0-7923-9530-1. [35] G. H. Granlund, H. Knutsson, C-J Westelius, and J Wiklund. Issues in robot vision. Image and Vision Computing, 12(3):131{148, April 1994. [36] L. Haglund. Adaptive Multidimensional Filtering. PhD thesis, Linkoping University, Sweden, S{581 83 Linkoping, Sweden, October 1992. Dissertation No 284, ISBN 91{7870{988{1. [37] O. Hansen and J. Bigun. Local symmetry modeling in multidimensional images. In Pattern Recognition Letters, Volume 13, Nr 4, 1992. [38] D. H. Hubel. Eye, Brain and Vision, volume 22 of Scientic American Library. W. H. Freeman and Company, 1988. ISBN 0{7167{5020{1. [39] A. D. Jepson and D. J. Fleet. Scale-space singularities. In O. Faugeras, editor, Computer Vision-ECCV90, pages 50{55. Springer-Verlag, 1990. [40] A. D. Jepson and M. Jenkin. The fast computation of disparity from phase dierences. In Proceedings CVPR, pages 386{398, San Diego, California, USA, 1989. [41] B. Julesz. Early vision and focal attention. Review of Modern physics, 63(3):735{772, 1991. [42] K. Kanatani. Camera rotation invariance of image characteristics. Computer Vision, Graphics and Image Processing, 39(3):328{354, Sept. 1987. REFERENCES 181 [43] J. Karlholm, C-J. Westelius, C-F. Westin, and H. Knutsson. Object tracking based on the orientation tensor concept. In Proceedings of the 9th Scandinavian conference on Image Analysis, Uppsala, Sweden, June 1995. SCIA. [44] H. Knutsson. Filtering and Reconstruction in Image Processing. PhD thesis, Linkoping University, Sweden, 1982. Diss. No. 88. [45] H. Knutsson and G. H. Granlund. Fourier domain design of line and edge detectors. In Proceedings of the 5th International Conference on Pattern Recognition, Miami, Florida, December 1980. [46] H. Knutsson, G. H. Granlund, and J. Bigun. Apparatus for detecting sudden changes of a feature in a region of an image that is divided into descrete picture elements. Swedish patent 8502571-6 (US-Patent 4.747.150, 1988, 1986. [47] H. Knutsson, M. Hedlund, and G. H. Granlund. Apparatus for determining the degree of consistency of a feature in a region of an image that is divided into descrete picture elements. Swedish patent 8502570-8 (US-Patent 4.747.152, 1988), 1986. [48] H. Knutsson and C-F Westin. Normalized and dierential convolution: Methods for interpolation and ltering of incomplete and uncertain data. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York City, USA, June 1993. IEEE. [49] H. Knutsson, C-F Westin, and C-J Westelius. Filtering of uncertain irregularity sampled multidimensional data. In Twenty-seventh Asilomar Conf. on Signals, Systems & Computers, Pacic Grove, California, USA, November 1993. IEEE. [50] E. Krotkov. Exploratory visual sensing for determining spatial layout with an agile stereo camera system. PhD thesis, University of Pennsylvenia, April 1987. [51] K. Langley, T. J. Atherton, R. G. Wilson, and M. H. E. Larcombe. Vertical and horizontal disparities from phase. In O. Faugeras, editor, Computer Vision-ECCV90, pages 315{325. Springer-Verlag, April 1990. 182 REFERENCES [52] J. C. Latombe. Robot Motion Planning. Kluwer Academic Publishers, 1991. ISBN 0-7923-9129-2. [53] B. Maclennan. Gabor representations of spatiotemporal visual images. Technical Report CS-91-144, Computer Science Department, University of Tennesse, September 1981. [54] D. Marr. Vision. W. H. Freeman and Company, New York, 1982. [55] J. Matas, R. Marik, and J. Kittler. Generation, verication and localisation of object hypotheses based on colour. In British Machine Vision Conference, pages 539{548, 1993. [56] J. Matas, R. Marik, and J. Kittler. Illumination invariant colour recognition. In E. Hancock, editor, British Machine Vision Conference, pages 469{479. BMVA, BMVA Press, 1994. [57] R. Milanese. Focus of attention in human vision: a survey. Technical Report 90.03, Computing Science Center, University of Geneva, Geneva, August 1990. [58] R. Milanese. Detection of salient features for focus of attention. In Proc. of the 3rd Meeting of the Swiss Group for Articial Intelligence and Cognitive Science, Biel-Bienne, October 1991. World Scientic Publishing. [59] D. Murray and A. Basu. Motion tracking with an active camera. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5):449{459, May 1994. [60] K. Pahlavan. Active robot vision and primary occular reexes. PhD thesis, Royal Institute of Technology, May 1993. ISSN 1101-2250. [61] L. H. Quam. Hierarchicl warp stereo. In Proceedings from DARPA Image understanding workshop, pages 149{155, 1984. [62] D. Reisfeld, H. Wolfson, and Y. Yeshurun. Context free attentional operators: the generalized symmetry transform. International Journal of Computer Vision, 1994. special issue on qualitative vision. [63] T. D. Sanger. Stereo disparity computation using gabor lters. Biological cybernetics, 59:405{418, 1988. REFERENCES 183 [64] E. L. Schwartz. Computational anatomy and functional architectur of striate cortex: A spatial mapping approach to perceptual coding. Vision Reasearch, 20:645{669, 1980. [65] M. Tistarelli and G. Sandini. Direct estimation of time{to{impact from optical ow. In Proceedings of IEEE Workshop on Visual Motion, pages 52{60, Princeton, USA, October 1991. IEEE, IEEE Society Press. [66] J. K. Tsotsos. Localizing stimuli in a sensory eld using an inhibitory attentinal beam. Technical Report RBCV{TR{91{37, Department of Computer Science, University of Toronto, October 1991. [67] J. K. Tsotsos. On the relative complexity of active vs. passive visual search. Int. Journal of Computer Vision, 7(2):127{142, Januari 1992. [68] J. van der Spiegel, G. Kreider, C. Claeys, I. Debusschere, G. Sandini, P. Dario, F. Fantini, P. Bellutti, and G. Soncini. A foveated retinalike sensor using ccd technology. In C. Mead and M. Ismael, editors, Analog VLSI implementation of neural systems. Kluwer, 1989. [69] Esprit basic research action 3038, vision as process, nal report. Project document, April 1992. [70] J. Y. A. Wang and E. H. Adelson. Layered representation for motion analysis. In IEEE Conference on Computer Vision and Pattern Recognition, pages 361{366, June 1993. [71] J. Weng. Image matching using the windowed Fourier phase. International Journal of Computer Vision, 11(3):211{236, March 1993. [72] C-J Westelius. Preattentive gaze control for robot vision, June 1992. Thesis No. 322, ISBN 91{7870{961{X. [73] C-J. Westelius and H. Knutsson. Hierarchical disparity estimation using quadrature lter phase. International journal on computer vision, 1995. Special issue on stereo, (submitted). [74] C-J Westelius, H. Knutsson, and G. H. Granlund. Focus of attention control. Report LiTH{ISY{I{1140, Computer Vision Laboratory, Linkoping University, Sweden, 1990. 184 REFERENCES [75] C-J Westelius, H. Knutsson, and G. H. Granlund. Focus of attention control. In Proceedings of the 7th Scandinavian Conference on Image Analysis, pages 667{674, Aalborg, Denmark, August 1991. Pattern Recognition Society of Denmark. [76] C-J Westelius, H. Knutsson, and G. H. Granlund. Preattentive gaze control for robot vision. In Proceedings of Third International Conference on Visual Search. Taylor and Francis, 1992. [77] C-J Westelius, H. Knutsson, and J. Wiklund. Robust vergence control using scale{space phase information. Report LiTH-ISY-I-1363, Computer Vision Laboratory, Linkoping University, Sweden, 1992. [78] C-J Westelius and C-F Westin. A colour representation for scalespaces. In The 6th Scandinavian Conference on Image Analysis, pages 890{893, Oulu, Finland, June 1989. [79] C-J Westelius and C-F Westin. Representation of colour in image processing. In Proceedings of the SSAB Conference on Image Analysis, Gothenburg, Sweden, March 1989. SSAB. [80] C-J. Westelius, C-F. Westin, and H. Knutsson. Focus of attention mechanisms using normalized convolution. IEEE Trans on Robotics and Automation, 1996. Special section on robot vision. (submitted). [81] C-F Westin. Feature extraction based on a tensor image description, September 1991. Thesis No. 288, ISBN 91{7870{815{X. [82] C-F Westin. A Tensor Framework for Multidimensional Signal Processing. PhD thesis, Linkoping University, Sweden, S{581 83 Linkoping, Sweden, 1994. Dissertation No 348, ISBN 91{7871{421{4. [83] C-F Westin and C-J Westelius. A colour model for hierarchical image processing. Master's thesis, Linkoping University, Sweden, August v1988. LiTH{ISY{EX{0857. [84] J. Wiklund and H. Knutsson. A generalized convolver. In Proceedings of the 9th Scandinavian conference on Image Analysis, Uppsala, Sweden, June 1995. SCIA. REFERENCES 185 [85] J. Wiklund, C-J Westelius, and H. Knutsson. Hierarchical phase based disparity estimation. In Proceedings of 2nd Singapore International Conference on Image Processing. IEEE Singapore Section, September 1992. [86] J. Wiklund, C-F Westin, and C-J Westelius. AVS, Application Visualization System, software evaluation report. Report LiTH-ISYR-1469, Computer Vision Laboratory, S{581 83 Linkoping, Sweden, 1993. [87] R. Wilson and H. Knutsson. A multiresolution stereopsis algorithm based on the Gabor representation. In 3rd International Conference on Image Processing and Its Applications, pages 19{22, Warwick, Great Britain, July 1989. IEE. ISBN 0 85296382 3 ISSN 0537{9989. [88] A. L. Yarbus. Eye movements and vision. Plenum, New York, 1969. 186 REFERENCES

Download PDF

advertisement