Motion and Structure. Application to feature-oriented coding As mentioned in Chapters 1 and 3, observation and associated estimation methods of an apparent motion vector eld within an image sequence results from the projection of 3-D objects and from their 3-D motion on the 2-D image plane. This projection operation, which is perspective or orthogonal in nature, depending on the projection system selected, creates ambiguities concerning apparent 2-D motions perceived and, in addition, does not generate a compact representation of the motion information itself. In fact, if we take the example of a rigid 3-D body undergoing 3-D motion, this motion of the object is wholly specied by a small number of parameters (generally six degrees of freedom) through the kinematic screw (translation + rotation) associated with the object and referenced in relation to an absolute xed reference. This same 3-D motion observed through the 2-D apparent motion-vector eld is, on the other hand, much more complex to analyze and to represent. A more compact representation and more eective estimation of complex motions which are not purely translational parallel to the image plane, constitute the two essential arguments in favour of a higher level modelling for the motions and structures of objects manipulated. All of the motion estimation techniques detailed in the preceding chapters limited themselves: to a local estimation by pixeldyfor which the representation of motion by its apparent t t motion vector (x_ y_ )t = ( dx dt dt ) = (u v ) - two translational components - is adequate. Clearly, it is impossible to talk about rotational motion of an object restricted to one pixel. to a global estimation of a translation vector (u v)t by block (block matching) or region. This representation of the apparent motion eld only makes it possible to model and identify a constant and purely translational motion parallel to the image plane by object (region, block...) which constitutes a very restrictive class of 3-D motions of an actual natural scene. Let us recall that in the case of sensor motions which are not purely translational parallel to the image plane, which is often the case in televisual scenes (tilt, panning, translations parallel to the optical axis...), the apparent motion vector eld cannot be correctly represented on regions or blocks by a simple 2-D translation. As far as modelling and identication of 3-D motion parameters are concerned, there are several possibilities. Firstly (Section 8.1), we recall the geometrical relations between 3D motions, 3-D structures (i.e., 3-D geometry of objects) and apparent 2-D motions in the case of the visual \perspective" projective system. The particular cases of the description of objects by planar facets and low-order parametrized approximation of motion vector elds (1st order: ane models and 2nd order: quadratic models) are more particularly detailed. As far as the resolution methods and the application frameworks envisaged are concerned, we will present separately: 1 the monocular case where a unique sensor (if necessary moving) perceives the dy- namic scene and, through spatio-temporal observations, tries to reach both motion information and that concerning the structure of objects. The applications within coding schemes concern compression methods (\second" generation with very low rates) or techniques of analysis/synthesis by extraction of high-level global primitives. the stereoscopic case where several sensors (2 or even 3 sensors) simultaneously perceive the same dynamic scene which makes it possible to identify, either in parallel or jointly, the structural and motion parameters of the 3-D objects which constitute the scene. Many studies have been carried out into stereo-motion cooperation within the eld of Articial Vision, primarily with the aim of 3-D reconstitution of objects or of robot navigation in complex environments. More recently, for 3-D TV or stereoscopic sequence dynamic restitution applications (CAD of 3-D objects, computer-assisted chirurgical operation...), these techniques have also been studied with the aim of improving image reconstitution quality after analysis/synthesis phase or compression/decompression. Whilst still remaining at the heart of similar motion estimation schemes, the bi- or tri-nocular stereoscopic case makes it possible to improve the observation space and to solve some ambiguities in temporal occlusion regions. Some results of simulation of predictive coding schemes with motion compensation will be given, which enables to measure the performance of associated estimators. 1 Models and descriptors of 3-D motions 1.1 Relations between 3-D motions and apparent motions _ Y_ Z_ )t of Let us recall the geometric relations which link the 3-D motion vector V~ = (X a point (X Y Z )t of the surface of an object in motion and its projection (x_ y_ )t = (u v )t on the image plane.. We examine the case of the perspective projective system where Y y = f (1) x=fX Z Z In order to simplify the notations, we will select the term f to designate the ratio focal length/pixel size as having a normalized value of 1. The 3-D motion vector V~ can be expressed using the instantaneous translation vector T~ and of the instantaneous rotation vector ~ of the kinematic screw associated with the moving object 26], i.e., 2 3 X V~ = T~ + ~ ^ 64 Y 75 (2) Z which is expressed, by components, as 2 _ 3 2 3 2 3 X T Y Z ; Z Y X 64 Y_ 7 (3) 5 = 64 TY 75 + 64 Z X ; X Z 7 5 _Z TZ X Y ; Y X 2 In the same way, the components of the apparent motion vector associated with the point (x y ) in the image plane, are dened in the case of perspective projection by " # " # " X_ ;xZ_ # x_ = u = Z (4) Y_ ;yZ_ y_ v Z which after replacement in Equation (4) of the expressions dened in Equation (3) gives " # " T # x_ = ZX + Y ; TZZ x ; Z y ; X xy + Y x2 (5) TY + ; TZ y + x + xy ; y 2 y_ X Z Z Y X Z The relations (5) are fully specied when the term 1=Z is also expressed as a function of the local pixel-coordinates (x y ). In order to retain a maximum quadratic order in Equation (5) as a function of the coordinates (x y ), but particularly since the structural terms of a geometric surface greater than order 1 are dicult to identify without bias on real images, a priori hypotheses concerning the regularity of surfaces are given. Then, if the term Z (and, therefore, the term 1=Z ) is expressed by a rst order Taylor development, @Z ) X + ( @Z ) Y + o2(X Y ) Z = Z0 + ( @X @Y 0 0 (6) = Z0 + Z1 X + Z2 Y + o2 (X Y ) it leads to 1 = 1 (1 ; Z x ; Z y ) + o2(x y ) (7) 1 2 Z Z0 subsequently noted by 1 2 (8) Z = nX x + nY y + nZ + o (x y ) (nX nY nZ ) specifying the terms of the structure of the local surface which is approximated here by a planar facet (Equation (6)) around (X0 Y0 Z0). Currently, the reference point selected will be the center of gravity of the region for which planar facet approximation (6) or (8) is carried out, being 0 Y0 )t (xg yg )t = ( X (9) Z0 Z0 1.2 Ane and quadratic models Equation (5) linking the apparent motion components (x_ y_ )t to the pixel coordinates and the surface approximation carried out in (8) making it possible to establish a quadratic relation between ~v = (x_ y_ )t and the coordinates of the point where this measurement is carried " # out " # x_ = a1 + a2 x + a3y + a7 xy + a8x2 (10) y_ a + a x + a y + a xy + a y2 4 5 where, 8 a1 = TX nZ > > > > > a2 = TX nX > > > > > a3 = TX nY > > < a4 = TY nZ > a5 = TY nX > > > > a6 = TY nY > > > > > a = ;T n > > : a7 = ;T Zn Y 8 Z X 6 8 7 + Y ; TZ nZ ; Z ; X + Z ; TZ nZ ; X + Y (11) 3 1.2.1 Justication of the linear approximation Two sub-models of the motion vector local eld can be introduced naturally from Equation (10). 1. a linear model: (dim=6) restricts itself to motion parameters (a1 a2 a3 a4 a5 a6). This model is also called an ane model in so far as it makes it possible to identify an ane pixel-based transformation. In fact, if the pixel pt+t = (xt+t yt+t )t is matched to the pixel pt = (xt yt )t by the ane relation pt+t = Apt + B " (12) # " # x_ ' 1 (p ; p ) = 1 (A ; I ) xt + B 2 y y_ t t+t t t t t (13) we again nd the linear relation between motion vector eld and pixel-coordinates. An important consequence of this observation is that, when such a linear motion model is used, the properties of ane transformations will be used, implicitly: in particular, let us refer to the transformation of a linear segment in a linear segment, of a polygonal region in a polygonal region and the maintenance of convexity property. 2. a quadratic model (dim=8) using all the parameters fai gi=1:::8 dened in Equation (10). We will then see that these models, even if they prove to be more complete, come up against two major problems: it turns out that it appears dicult to obtain an accurate estimation of quadratic terms from previously estimated 2-D apparent motion measurements the model described by the Equation (10) is already a restrictive model compared to a general quadratic model which would contain six quadratic terms and is only obtained by rst order approximation of local surfaces and rigid motion hypothesis secondly, the use of a quadratic parametric model in motion compensation only brings minor improvements in the regions of complex motions and can even prove to be less ecient than the use of a lower order parametric model. 1.2.2 Illustration of particular cases of linear modelling Case 1: If the instantaneous rotation vector ~ = (X Y Z )t is equal to (0 0 Z)t, that is to say where only rotations around the center of gravity of the region, and with a rotational axis parallel to the optical axis are authorized, then the development (10) becomes: " # " # " #" x_ = TXg + k ; x ; xg y_ TYg k y ; yg " #" # T n T n x ; x X X X Y g + T n T n y ; yg Y X Y Y with 4 # (14) (TXg TYg )t = (a1 a4)t = (TX nZ TY nZ )t, translation vector of the center of gravity of the region which, as we note, in relation to the 3-D translation components, is only dened to within one nZ factor (similarity factor on the Z axis). k = ;TZ nZ and = Z , terms which are very often preponderant in translation and rotation along the optical axis the other terms constituting crossed motion and structure terms along the other axes Case 2: Simplied linear model (SLM model) An even rougher form of modelling of the structural geometry of objects and regions consists of considering the scene as a succession of planar facets parallel to the image plane, in the same way as a z-buer in infography. This leads to nX = nY = 0 and, consequently, " # " # " #" # x_ = TXg + k ; x ; xg (15) y_ TYg k y ; yg The merit of this form of modelling is that it provides a compact representation (4 parameters) for the description of the eld and a simple interpretation concerning the 3-D motion components: TX TY TZ and Z = . Case 3 : Constant Model (CST model) Finally, let us recall the case of the constant model, restriction of the linear model solely to 0 order terms. This model, which is widely used in motion compensation by regions nevertheless proves limited in identifying complex global 3-D motions. 1.3 Linear approximation of the motion vector eld and choice of 2 12 -D descriptors The analysis base for specifying the geometry of the motion vector eld as specied by the Equation (10) is not of course unique. To convince ourselves of this it is possible, through dierential operators, to return to the general formulation of a vector eld with, for example, linear geometry " # " # " @u @u # " # u = ug + @x x ; xg @y (16) @v @v v vg y ; yg @x @y which corresponds to a development limited to rst order of the eld around the point (xg yg ), or " # " # " # u = ug + M x ; xg (17) v vg y ; yg Francois et Bouthemy 13], and Simard and Mailloux 44] recall that the M matrix can be re-written as: M = 12 trace" (M)II +# 12 (M ; "MT ) + 12 (#M + MT ;" trace(M#)II) = 21 div 10 01 + 12 rot 01 ;10 + 21 hyp1 10 ;10 (18) " # + 12 hyp2 01 10 5 which makes it possible to introduce general dierential operators for the description of a vector eld (not necessarily linear) at each (x y ) point divergence = div(u v ) rotational = rot(u v ) hyperbolic 1 = hyp1 (u v ) hyperbolic 2 = hyp2 (u v ) @v = @u @x + @y @v ; @u = @x @y @u = ; @v @y + @x @v + @u = @x @y (19) Examples of synthetic elds are provided by Figure 1 and illustrate fairly well the physically interpretable nature of these dierential descriptors. Using these, we thus specify a linear geometry motion vector eld by " #" # " # " # x ; xg x_ = ug + 1 (div + hyp1 ) (hyp2 ; rot) (20) y_ vg 2 (rot + hyp2 ) (div ; hyp1 ) y ; yg The analogy with the ane decomposition model dened at Equation (10) makes it possible to dene the change of basis between descriptor sets. 8 a1 = ug = Txg > 8 > > ug = a1 > > div + hyp > > 1) > > a = ( > > 2 > > 2 div = a2 + a6 > > > > < a3 = ( hyp2 ;rot ) < rot = a5 ; a3 2 () > vg = a4 (21) > a = v = Tyg 4 g > > > > > > > > hyp = a ; a > > > > a5 = ( rot+2hyp2 ) > : hyp1 = a2 + a6 > > 3 5 : a = ( div;hyp1 ) 2 6 2 According to the estimation method (evoked in Section 8.2) and the intended application (qualitative interpretation and/or use in motion compensation), it would be advisable to select whichever set of descriptors proves to be the most eective. Finally, let us stress that the particular case of linear models dened by Equation (15) corresponds to the case in which the hyperbolic terms (hyp1 and hyp2 ) are disregarded, that is to say: ( a2 = a6 = 12 div a3 = ;a5 = ; 12 rot (22) 1.4 Design and use of an apparent motion model hierarchy Up until now, studies carried out in the eld of motion estimation-compensation only used a pre-dened motion model, without seeking to adapt it to the various motions present within the image. Let us note that, as a general rule, it is the region-constant model which is used. Now, as there are generally several dierent types of motions in a single natural image sequence, it would seem to be interesting to adapt the motion model to be identied locally, this, essentially, for the following two reasons: the identication of a too simple motion model (for example a constant model) in a region in which the physically observed motions are complex (some sort of 3-D motion of a rigid body for example) can only lead to poor reconstitution by motion compensation or to an over-segmentation of the region (possibly down to pixel level) costly in terms of volumes of motion information to estimate and to transmit (see Figure 1). 6 Figure 1: Illustration of the eect of the selection of a model on segmentation: if a divergence model is used, the whole of the vector eld constitutes a single homogenous region, on the other hand, if a constant model is used, it is necessary to decompose the main region into several sub-regions (thus more descriptors are used) and that for a less eective result. the identication of a sophisticated motion model (for example a quadratic model) on a region in which a single motion can be observed (for example 2-D translation motion parallel to the image plane) leads to large estimation bias, including on the signicant parameter sub-vector corresponding to the single motion which naturally should be identied. In fact, as we will establish in the next paragraph, the criterion to be minimized in the motion parameter vector estimation diagram is very often global, since it is simultaneously dependent on all the components of the motion vector to be identied. Thus the components which are not actually observable introduce bias on the identication of the components of the true motion. Naturally, paragraphs 8.1.2 and 8.1.3 introduced several motion models of increasing complexity. Figure 2 illustrates how these dierent models can be placed in a hierarchy from the most simple (zero motion) to the most complex. As with 8] and 39], we have included the possibility of introducing into the motion parameters vector to be identied, an estimate of the illumination variation, considered as a potential source of temporal change in the intensity function. Once this model hierarchy has been identied (denoted by M ), it is advisable to dene the path strategy within this hierarchy. The introduction of the notion of local adaptivity of motion models signies the choice from amongst the M entity of the most \probable" model in the sense of a cost or performance criterion for the model . This cost function very often depends on: the error due to reconstitution by motion compensation associated with the model the cost of representation (indeed of transmission if the motion vector eld is transmitted in accordance with the coding schemes considered) of the motion information (parameters vector with dimensions which vary depending on the model) 7 Null motion Constant motion Rotation Divergence Simplied linear motion Linear motion Equivalent models Ane motion Quadratic motion Figure 2: A model hierarchy the size of the region considered in order to avoid an under- or over-segmentation of the image the operational cost of the identication of the vector It is easy to distinguish two extensive methodologies for the eective use of the M entity of motion models: 1. Parallel approach: a test in parallel of all motion models is carried out, region by region in the sense of a MAP criterion, and the most eective model is selected. The clearly formalized mathematical framework of the statistical criteria based to the information theory 40] makes it possible to solve this problem. 2. Sequential approach: this involves using the hierarchy of M models in accordance with a pre-dened path which can be either: from the simplest to the most complex model (\coarse to ne" approach) from the most complex to the simplest by progressive suppression of the components of the motion vector (\ne-to-coarse" approach) from an averagely complex intermediate model (for example an SLM model introduced in paragraph 8.1.2) to a more complex or more simple version. For all these sequential approaches, the mathematical framework for the tests of the hypotheses based on likelihood functions appears well adapted: two hypotheses will be tested by comparison with each other, for example in the sense of maximum likelihood: Hypothesis H0: the motion of the current region corresponds to a motion model 8 Hypothesis H1: the motion of this same region corresponds to a just slightly more complex ( + 1) motion model. In conclusion, let us note that within the context of the use of such a motion hierarchy, the representation of the motion information will consist of two information elds: the map of models selected (one label fg per region) the motion parameter vector eld itself. Let us also recall that the size of the vector varies depending on . 2 Estimation methods in the monocular case 2.1 Estimation of the sensor motion of a static scene Several motion estimation algorithms try, before or at the same time as the estimation of a dense motion information eld, at all points or in all regions of the image, to estimate the sensor motion, in order to be able to identify not the relative motions between the camera and the objects, but the absolute motions of the objects in relation to a xed reference. A priori, the camera has freedom of motion throughout the six dimensions of a true motion (3-D translation and 3-D rotation). According to certain hypotheses (see 16], 50], 39]) involving, in particular, the relative distancing of objects present in the scene in relation to the small angles of rotation during a panoramic motion of the sensor, the camera motions can be reduced to the following three classes: translations parallel to the image plane (including panning). translations perpendicular to the image plane (divergence) analytically equivalent to a change in focal length (zoom). rotations around the optical axis. It can thus be seen that a simplied linear motion model (SLM model with SLM = (tx ty k )), as introduced by Equation (15), makes it possible to identify such a sensor motion. This sensor motion can be estimated directly by one of the methods introduced in the paragraph below. The entire image is then considered as a single region whose center of gravity is the center of the image also identied at the projection of the optical center. Other quantitative information (localization of xed objects in the scene whose apparent motion is thus not due to the sensor motion alone) or qualitative information (known nature of the sensor motion model) can be injected easily into the algorithm, in order to ease and improve the estimate. A priori, such knowledge is rarely available in the case of communication services (contribution, distribution, storage services, etc...) which is the opposite of applications which use \closed-loop" dynamic imagery, that is to say where information is available concerning the sensor motion from its own control (e.g., tele-monitoring, vision for robotics, etc...). The results in Figures 3 to 7 illustrate the performance obtained when sensor motion is taken into account, in terms of compactness of motion representation and of the error due to reconstitution by motion compensation, in the limited case in which only this sensor motion estimation is carried out. 9 a b c Figure 3: (a) and (b), two original frames of the \Kiel harbour" sequence, (c) Frame dierence image with MSE=922.5 2.2 Estimation methods of motion descriptors for a moving scene All the motion estimation methods - closely related to the aspects of segmentation based on motion in the case of motion estimators by regions - were discussed in Chapter 3, essentially using the 2-D constant translation model (tx ty ). Let us also recall that the following general classes of motion estimation were presented: translation of a 2-D region (whose \block-matching" algorithm is an example) pel-recursive algorithms iterative algorithms analysis of spatio-temporal frequencies parametric models segmentation/estimation link Below we detail how these methods can be extended naturally for more complex parametric motion models (already presented in Section 3.3.2.5). However, two cases present themselves depending on the existence or otherwise of a dense apparent motion vector eld preliminary to the estimation of the parameters of more global models. We deal brie!y with the case in which such a dense eld preexists since, clearly, an algorithmic scheme complete as much for coding as for analysis, will tend to remove itself from the calculation of this dense eld, sometimes very operationally complex, if it is not useful. Let us note, 10 a b c d Figure 4: (a) Identication of a global (camera) motion using a divergence motion model, (b) optical !ow relative to the global motion, (c) Dierential !ows, (d) Motioncompensated frame dierence image only based to the global motion (a) MSE = 56.3 a b Figure 5: (a) and (b), two original frames of the \Interview" sequence however, that through the analytical relations detailed below, it is still possible to pass from a sparse eld of motion descriptors to a dense apparent motion vector eld and vice versa. 2.2.1 Estimation of a parametric model from a dense motion vector eld As we saw in Chapter 3, many methods make it possible to obtain a dense motion vector eld. An illustration is provided below (Figure 8) with the Horn-Schunck algorithm 17]. The idea is to use this dense information in order to extract from it parameters of a more global model (for example an ane or SLM model as illustrated in Figure 9). 11 a b c Figure 6: (a) Identication of a global (camera) motion using a constant motion model, (b) optical !ow relative to the global motion, (c) Dierential !ow a b Figure 7: (a) Frame dierence image with MSE=137.4, (b) Motion-compensated frame dierence image only based to the global motion MSE=100.9 At this stage, we assume that we have image segmentation into homogenous regions in the motion sense. The parameters are obtained: by minimization of the mean square error between the initial dense eld and the dense eld derived from the parametric model (15], 29], 16]) for example, let us consider an SLM model with parameters SLM = (tx ty k )t for a region R and an initial dense eld noted as f(ui vi)g for each pixel 2 R indexed by i with coordinates (xi yi ) the error to be minimized is therefore expressed as: E2 = X (tx + kxi ; yi ; ui )2 + (ty + kyi + xi ; vi )2 i2R 12 (23) Figure 8: Example of an optical !ow obtained by the Horn-Schunck method 17] on dierents areas where \pure" divergent, translational, rotational and ane !ows have been synthetized The least mean squares resolution requires the inversion of the 4 4 matrix (for such an SLM model). Simplications can be made 42] concerning this system's resolution. The resolution equations provide the vector of the following parameters: 8 > > > > > > > > > > > > > > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > : X X X tx = ui ; k xi + yi i i i X X X ty = vi ; k yi ; xi i X i X iX X X X ui xi ; uixi+ vi yi ; vi yi k = i Xi !2 iX iX i!2 Xi xi ; x2i + yi ; yi2 i X i X X Xi Xi X ui yi; ui yi ; vi xi+ vi xi = i X i!2 Xi iX !2 i X i xi ; x2i + yi ; yi2 i i i (24) i by separable identication of global translation motions and rotation/divergence in relation to the center of gravity of the region considered, by simple averaging of local estimates 37] the following global parameters are obtained: X 8 > t = ui x > > > > i X > > > > > < ty = i vi X xi (ui ;tx )+yi (vi ;ty ) > k = > > xi 2 +yi 2 > > i > X > > xi (vi ;ty );yi (ui ;tx ) > > : = xi 2 +yi 2 0 (25) 0 0 0 0 0 0 0 i where (x0i yi0 ) represents the of the region considered. relative coordinates in relation to the center of gravity 13 Figure 9: Identication of the ane motion model descriptors on the four regions (Velocity eld obtained by using the system in Equation (25)) 2.2.2 Direct parametric estimation Least Mean Square Estimation By extension of the methods introduced in Chapter 3 (paragraph 3.3.2.5) it is quite possible to envisage the introduction into the resolution model of a more complex model (ex. here of an ane model). The resolution of the motion constraint equation is expressed by: for the region R, the optimal estimated motion R will be = (a1 a2 a3 a4 a5 a6 )t R X = arg min (Ix (p)u( ) + Iy (p)v ( p2R ) + It (p))2 (26) with u( ) = a1 + a2 x + a3 y and v ( ) = a4 + a5 x + a6y (ane model). The least squares resolution is achieved by resolution of a linear system of six equations. Certain simplications have been proposed 42], 15]. Estimation by a generalized gradient method (see Chapter 3) Here we seek the solution minimizing the motion compensation mean square error across the whole of the region R by the gradient optimization technique. ^ = arg min X DFD2 (p ) p2R X (27) = arg min (I (i j k) ; I (i ; u( ) j ; v ( ) k ; 1))2 p(ij )2R The gradient algorithm (35], 42], 37]) then becomes general to the following iterative estimation process: m ~ m+1 = ~ m ; ; ~ (28) NR 14 2 @ DFD2 (i j ~ m) @a 1 X 6 .. 64 with ~ m = . (ij )2R @ DFD2 (i j ~ m) 3 7 7 5 @an where, m designates the iteration index NR the size of the region R ; a gain matrix which can be either xed, m adaptive, full or empty limited to a ~ diagonal matrix, the corrective term ; between two iterations is carried out in the direction of the gradient of each component.. In the case of an ane model where ~ = (a1 a2 a3 a4 a5 a6)t , the estimation of ~ is obtained iteratively by: ~ m+1 = ~ m ; X ;~m (i j )DFD((i j ) m) (29) (ij )2R with the displaced gradient vector ~ m equal to 2 Ix (i ; u( m ) j ; v ( m ) k ; 1) 3 66 iIx(i ; u( m ) j ; v ( m ) k ; 1) 7 66 jIx(i ; u( m ) j ; v ( m ) k ; 1) 7 7 m ~ (i j ) = 66 I (i ; u( m) j ; v ( m ) k ; 1) 777 (30) 64 y 7 iIy (i ; u( m ) j ; v( m ) k ; 1) 5 jIy (i ; u( m ) j ; v( m ) k ; 1) The ; gain matrix is taken diagonal in order mto avoid interaction between the dierent descriptors, otherwise the corrective term ~ would not be taken in the direction of the gradient. Elsewhere, in practice, it is necessary to take account of the dierence in scale and in physical size which exists between the various components of the vector ~ of motion parameters. Thus the \constant" parameters (a1 and a4 ) of a ane model will be allocated a larger gain than the other descriptors. The estimation-segmentation link The identication of the previous motion models requires the denition of a segmentation, either prior to, or concomitant with, the motion estimation phase itself, since this operates on an region R of matched pixels. Generally speaking, two approaches can be used: 1. the denition of a segmentation which is either arbitrary (decomposition of the image into blocks) or independent of motion (purely spatial segmentation which has the major inconvenience of constituting an over-segmentation from the motion point of view). This segmentation can be either monogrid, or in relation to a pyramid of information 15], 42], a quadtree splitting 39] or a splitting/merging into regions 6], 14]. In the case of a pyramidal structure, the elements of this structure inherit motion parameter vectors calculated at a coarser level and a correction to this motion prediction is carried out by parametric estimation as described previously. Segmentation into a quadtree allows the progressive decomposition of an image into smaller and smaller regions making it possible rstly to identify the more global 15 attributes and to lead to identication of local motions (even at pixel level, if the quadtree is complete) at the end of the estimation process. Clearly, a splitting criterion has to be dened it can be based on the following tests of hypotheses: test of a region's homogeneity The test consists of comparing the motion homogeneity hypothesis (the R0 region corresponds to a 0 parametric model) with that of inhomogeneity (presence of several motions). According to Gaussian hypotheses (and zero-mean laws) concerning associated error functions, the search for maximum likelihood leads to testing the following estimated variance: 1 X DFD(i j )2 > or < 2 (31) R0 NR0 (ij)2R0 test of division of a region into L sub-regions In this context, the test consists of comparing the following hypotheses: Hypothesis H0: the region R0 corresponds to a unique parametric model. Hypothesis H1: the region R0 could be decomposed into sub-regions, on each one of which a Rl parametric model must be identied. Bouthemy and Santillana-Rivero 6] test the case in which the region divides up into two sub-regions. According to the same hypotheses as previously mentioned, the likelihood test between the two hypotheses (hypotheses H0 and H1 associated with likelihood functions f0 and f1 ) leads to the following test: ^ ^ H H log f (1^ 2) <0 or >1 (32) f (0) and we obtain the following criterion: H H NR0 log 02 ; NR1 log 12 ; NR2 log 22 <0 or >1 (33) where, { NR0 NR1 X NR2 designate respectively the surfaces of the regions R0 R1 R2 1 2 { i = N i DFD2(p ^ i) p2Ri i.e., after linearization, X i2 = N1 i (Ix(p)u( ^ i ) + Iy (p)v ( ^ i) + It (p))2 R R p2Ri 2. Markovian models make it possible to specify eective observation interaction models (linked with spatio-temporal gradients) and labels (in our case motion parameters). Francois 14] thus denes a motion based segmentation by Markovian approach using an energy function composed of two terms: one term favouring identical labelling of two adjacent sites (region merging approach) one term seeking to maximize the likelihood of the observations depending on the labels (same formula as previously for i2) A deterministic relaxation scheme makes it possible to propagate labels. 16 In conclusion, in the case of the use of a motion parametric model in a motion compensation scheme, it seems important : to select the X criterion to minimize as a direct function of the local compensation errors, i.e., DFD(i j R)2 (ij )2R to smooth out the motion parameters eld to achieve better compactness of presen- tation. to avoid the convergence of the estimation process towards local minima of the nonconvex functional to be minimized. The latter two constraints are simply resolved by the introduction of a relaxation algorithm. to proceed with a \coarse-to-ne" analysis in a pyramidal or progressive region splitting sense. Several authors 15], 16], 42], 22], 39] have adopted these principles and obtain interesting results from the point of view of both vector eld regularity and motion compensation eectiveness. In Figure 9 we illustrate the example of the algorithm 37], 38] which will serve as a basis for the results on real sequences in paragraph 8.2.4 and Figures 10 and 11. 2.3 Model hierarchy In the case where, for a given region R, a notion of adaptation of a model to the region is envisaged, it is best to dene a selection criterion for the optimum model from all the parametric models written M . Two families of criteria can be used depending on the sequential or parallel approach desired (see Section 8.1.4). 1. Likelihood ratio The procedure is identical to that described previously in the context of splitting of regions. It is a matter of testing, for the same estimation surface (the current region R ), two hypotheses: Hypothesis H1: the use of a \complex" model t 1 = (a1 : : : ar : : : an ) Hypothesis H0: the use of a \simple" model, restriction to r parameters (r < n) of the previous model 0 = (a1 : : : ar )t will be selected in accordance with the most probable hypothesis by comparison with a threshold of the likelihood ratio associated with the two hypotheses. According to certain hypotheses (see 14]), it has been shown that this ratio L can be written in the form (34) L = N2R log (1 + W ) = log ff1 0 where f0 and f1 are respectively the likelihood functions under hypothese H0 and H1 and where W is proportional to a random process according to a Fisher's law which makes it possible, assuming the prior selection of an error probability (for 17 example = 0.05 ) to x a direction for the test of the hypotheses. In many cases for coding applications, likelihood functions are relative to the motion-compensated mean square errors. 2. Information statistical criteria In this context, it is possible to use Akaike and Rissanen's information statistical criteria 40] which, for a given model, evaluate both its performance and its complexity. Generally speaking, these two criteria are expressed in the form: AKAIKE criterion: C = ;2 log f (y= ) + 2dim( ) (35) RISSANEN criterion: C = ;2 log f (y= ) + 2dim( ) log NR (36) where f (y= ) is the likelihood of y conditional to . The rst terms of these two criteria constitute the model performance measures (likelihood), whilst the second are penalization terms for complex models. A practical implementation, in order to obtain motion compensation using a motion model hierarchy, was tested 39] by using a measurement criterion derived X DFD2 ) to from the Rissanen criterion and compatible with the function ( N1 R be minimized, already used in the expressed by: C = log N1 X (ij )2R vector estimation process. This criterion is DFD2((i j ) ) + rN( ) R (ij )2R R (37) where is a weighting coecient (for example, = 0.1). r(), motion model encoding rate, represents the volume of binary information in the entropic sense for example, required to represent and transmit the ~ parameters vector. If this criterion is applied to the two motion models 1 and 2, then the model 1 will be selected, if C1 > C2 (38) 2.4 Estimation of 3-D motion The estimation of 3-D motion based on image sequences can be carried out naturally using two distinct approaches. The rst of these, called the two-stage method, consists of calculating these 3-D motions from a previously estimated 2-D apparent motion vector eld. The second, called the direct method, attempts to evaluate these 3-D motions directly from spatio-temporal derivatives of the intensity function. We describe these two general approaches below. 18 2.4.1 Two-stage estimation methods This approach, which is similar to that evoked in paragraph 8.2.2 for the estimation of a 2 12 -D parametric model from a 2-D motion vector eld, is based on the following scheme: stage 1: estimation of a 2-D displacement vector eld which will be sparse (discrete methods of matching 2-D primitives) or dense (dierential methods) by one of the estimation methods described in Chapter 3. stage 2: By equations linking the projected 2-D motions and 3-D motions (see paragraph 8.1.1 in the case of a dense eld), this second stage identies the 3-D motion parameters based on the 2-D primitives' eld. We will deal with the case of discrete methods in Section 8.3, since it is very similar to the problem of stereovision-motion cooperation on discrete primitives. Within the context of dierential methods, many authors (1], 45], 55]) pose the problem of the determination of motion and of the 3-D structure from apparent motion in the form of the minimization of a quadratic criterion based on equations concerning 2-D/3-D relations. Even in accordance with the theory of the observation of rigid objects, Equation (5) shows that this problem of optimization is non-linear. As an example, in the case of dierential methods, Adiv 1] breaks this estimation process down into two stages. The rst of these consists of segmenting an apparent motion vector eld (assumed to have been calculated previously) into regions corresponding to planar facets. The parametric motion modules are thus quadratic models dened by the equations at (11). The estimation technique is based on a generalized Hough transform from Equation (5), the energy function % is dened by X % = (u ; ; T z )2 + (v ; ; T z )2 (39) R with 8 = ;xy X + Y (1 + x2) ; y Z > > > > > X (1 + y 2) + xy Y + xZ > < = ; T X T = k;TxTk Z (40) > TY ;yTZ > > = T > kT k > > : z = kT k Z which consists of separating the terms which involves the instantaneous translation vector T = (TX TY TZ )t and the instantaneous rotation vector ~ = (X Y Z )t respectively. Assuming the constancy of the energy function % depending on the relative depth variable z ( @@z = 0), we can deduce the optimum relative depth z = (u ; ()2T ++ (v2 ); )T (41) T T which, carrying over to Equation (39) 2 X (42) % = ((u ; )(T2 ++(v2;) )T ) N N R The unitary vector kTT k can then be parametered in an angular space ( ) such that: TX = sin cos TY = sin sin TZ = cos (43) kT k kT k kT k 19 and the energy function % is then parametered to %( ). The generalized Hough transform makes it possible to calculate the optimum couple ( T ) such that ( T ) = arg min %( ) (44) On completion of this rst stage, a fusion of adjacent components corresponding to the same parametric transformation is carried out, using least squares criteria. The algorithm continues by iterative sequencing of these motion-structure parameter estimation procedures and that of the grouping together of regions which correspond to a single transformation. Adiv 2] extends his work by raising the ambiguities inherent in the estimation of 3-D motion and of depth these ambiguities are essentially of two types: a single 2-D eld can have several 3-D interpretations (non-unicity of representation) 2], 5], 51]. an estimation bias on the 2-D primitives eld induces an estimation bias on the 3-D parameters and often creates phenomena of instability in estimations. 2.4.2 Direct estimation methods These methods seek to mitigate the drawbacks mentioned previously by direct estimation of parameters linked to motions and 3-D structures without previously estimated apparent motion elds. In this context, we again nd extensions of estimation methods known in the 2-D case, such as extensive recursive estimation methods in the case of parametric motion models (36], 9], 10], 39]) and iterative estimation methods based on the \brightness change equation" or extensive motion constraint equation in the case of 3-D motions and particular 3-D structures (planar, quadratic surfaces, ...) 18], 33]. Dugelay and Pele 9], and Netravali and Salz 36] start o from the following three-stage approach: from the Equations (11) dening the relations between apparent motion description parameters A, those of 3-D motion C = (~ T~ )t and those of structure K = ( nnXZ nnYZ 1)t from an initial vector or previous estimate: C n;1 Kn;1 it is possible to repeat the following three stages: Stage 1: calculation of An;1 from C n;1 Kn;1, initial values using Equation (11) Stage 2: a dierential method of estimating a corrective term An;1 is operated by gradient algorithm as follows (see Equations (28) to (30)): An;1 = X DFD(p An;1 t ; 1) A DFD(p An;1 t ; 1) p2R (45) Stage 3: based on the system of Equations (11) calculation of the parameters C n and Kn, function of (An;1 + An;1 ). 20 This system of 8 unknowns and 8 non-linear equations works out by successive linearization (Newton method) for example. The second family of approaches (18], 32], 33]) consists of starting with the theory of the temporal invariance of the intensity function expressed by the motion constraint equation Ix u + Iy v + It = r~I:~v + It = 0 (46) In a vectorial manner, Equation (4) deduced from the perspective projection model can be expressed: 2 3 u 7 ~z ^ (V~ ^ P~ ) 6 (47) ~v = 4 v 5 = (P~ :~z)2 0 _ Y_ Z_ )t , ~v = (x_ y_ 0)t and ~z is the unitary where p~ = (x y 1)t, P~ = (X Y Z )t, V~ = (X vector along the optical axis, with ~p = P~P~:~z . By substituting the expression of V~ (Equation (2)) in Equation (47), that gives us: ~ (48) ~v = ~z ^ (~p ^ (~p ^ ~ + ~T )) P :~z The motion constraint equation (46) expanded to the 3-D case is then expressed: ~ r~I:(~z ^ (~p ^ (~p ^ ~ + ~T ))) + It = 0 (49) P:~z or in a more compact fashion, if ~s = (r~I ^ ~z) ^ ~p and w~ = ~s ^ p~, then Equation (49) becomes ~s:T~ + w~ ^ ~ + I = 0 (50) t P~ :~z The resolution method often assumes a geometric structuren model. For example, in o the planar region case, we have the region of the 3-D points P~ dened by P~ :N~ = 1 which is equivalent to p~:N~ = P~1:~z . The motion constraint equation then becomes (~s : T~ ) (~p : N~ ) + w~ ^ ~ + It = 0 (51) and the resolution into (T~ ~ N~ ) is carried out by iterative resolution of a functional minimization algorithm Z Z J= ((~s:T~ )(~p:N~ ) + w~ ^ ~ + It )2dxdy (52) D These approaches are thus a direct extension of the iterative estimation method normally used in the 2-D case. Other region models have also been tried 33] such as quadratic patches, cylindrical surfaces, etc. 21 2.5 Use of motion compensation in a predictive coding scheme The use of parametric motion models within a predictive coding scheme with motion compensation (see Chapter 4 for an introductory description of these schemes) appears to be a natural extension of the usual case where a dense motion vector eld compensates the image. As a matter of fact, as illustrated by Equation (10) in the context of a general quadratic model, if, for each region Rm of the image, we have the motion parameter vector ( m ) identied corresponding to the motion model , it is always possible to derive a dense apparent motion vector eld from the f m g and use it in a motion-compensated loop. The prediction by motion compensation will be equal to I^(i j k) = I~(i ; u^( m ) j ; v^( m ) k ; 1) (53) for each pixel with coordinates (i j )t and where, I~ indicates the previously reconstituted image I^ indicates the current image to be predicted (^u v^) the dense eld predicted from the eld f(u v)g derived from the parameters f mg. Because of the compact nature of the representation of the motion information which represents the f mg, this information is usually transmitted and in this case f(^u v^)g is selected as being the estimated eld: u^(i j ) = u( m ) and v^(i j ) = v ( m ) for each pixel (i j ) 2 Rm . Let us recall that in such a scheme, the information transmitted has to be decomposed into four parts 1. the image segmentation into N regions fRm gm=1:::N 2. the type of model used m for each region Rm 3. the quantized motion parameters vector ( m ) for each region Rm 4. the quantized motion compensation error. As far as the coding of the segmentation map is concerned, a compromise has to be found between the following two extreme cases: 1. a priori known arbitrary segmentation such as a block decomposition: the coding cost for such a segmentation is null 2. adapted spatial segmentation on all images: consequence of extensive coding due to the fact of the irregularity of the edges obtained. Binary coding schemes adapted to edges (for example Freeman codes) can be used, even if it could use a lot of bit rate to encode this map of contours. Quadtree decomposition allows good adaptation of the segmentation to the local contents of the image at only a small coding cost expressed by 24], 39], 41], 43] (54) Rquadtree = 43 NR ; 13 NRinit ; NRmin 22 if NR NRinit NRmin designate respectively the number of regions within the nal image after the quadtree decomposition, the number of regions within the initial image (initial grid) and the number of regions with the minimal size (quadtree roots). The coding cost of the label m designating the motion model selected for the current region Rm clearly only exists in the case of the use of a distinct motion model hierarchy and can be accessed by an entropy cost. The parameter vector m is transmitted after quantization. Note that the various components of this vector do not require the same accuracy of quantization. Adapted quantizers must be designed for each component. Finally, the coding of the prediction error by motion compensation uses all codingsource techniques (transform coding, entropy coding, : : : ) again making it possible to decorrelate the information from a spatial or frequency point of view, and thus to reduce by as much, the transmission cost of this information eld. Figure 10 shows , applied to the so-called image sequence Interview, motion compensated error image when a motion-based quadtree segmentation is used. Moreover, the distorsion v.s rate trade-o is assessed in Figure 11 for several linear scalar quantization versions of the motion compensated errors. a b c d Figure 10: Motion compensation of the \Interview" sequence using a \constant motion" model. (a) Motion-compensated dierences : MSE=17.9, (b) quadtree segmentation (44 regions are not illustrated), (c) Reconstructed image, (d) Motion vector eld 2.6 Use of an analysis-synthesis coding approach The estimation schemes previously described lend themselves well to the denition of schemes involving object-oriented coding by analysis-synthesis. The rst work carried out in this eld (3], 11], 12], 20]) assumed an extensive knowledge of the nature of the objects manipulated and restricted itself to a particular category of scenes such as the 23 50 45 40 35 30 25 20 15 10 5 0 30 Compression ratio Q=128 Q=15 Q=10 Q=5 1 2 3 4 5 6 7 8 9 MSE Q=128 Q=15 25 20 Q=10 Q=5 15 10 5 0 1 2 3 4 5 6 7 8 9 Figure 11: \Interview" sequence. Compression ratio and MSE for dierent values of the elementary step of quantization for motion-compensated errors. motion of human faces (videophone services or video conferences with very small rates envisaged). In this case, the hypotheses in the preceding paragraphs, used to establish the relations between 3-D motion/structure and apparent 2-D motion, were valid: rigid objects decomposed into planar surfaces, small rotation angles, small depth variation between two successive images. Musmann et al 30] and H'otter 19] develop such an analysissynthesis object-oriented coding approach, using either the 2-D motion estimation by linear regression methods or the 3-D estimation by prediction/verication methods. The general scheme of the approach is described in the Figure 12. The sequence analysis phase concerns the extraction of three types of information: the shape of objects (regions) their motion 24 the texture or radiosity information Source Model Receiver model Image Analysis Parameter Coding Transmission channel Image Synthesis Motion parameter Shape parameter Texture parameter Parameter Decoding Memory for Object Parameter Figure 12: Block-diagram of an object-oriented analysis-synthesis coder These information elds being dierent in nature, a specic coding procedure is used for each of them. The shape information describes the outline of objects and this code naturally by contour coding techniques. Only temporal changes in shape will be coded predictively. Motion information also codes predictively in relation to motion parameters estimated on the same object to the previous image. Finally radiosity information can be compressed by hybrid coding techniques with motion compensation. In conclusion, let us note that these analysis-synthesis coding approaches are often limited to the identication of 2 12 -D motion parametric models without seeking the whole range of 3-D motion + structure parameters. Such a full range would make it possible to synthesize the scene not only from the true viewing angle at the current moment, but also from all sensor-object relative intermediate positions, which would make it possible to obtain ecient temporal or spatial interpolation schemes. This remains dicult to achieve, however, given the current levels of accuracy obtained on 3-D structural parameters after identication and given that these parameters are only known to a relative depth factor. The stereovision-motion cooperation techniques dealt with in the next section can make it possible in part to overcome these disadvantages. 3 Motion estimation methods in the binocular case 3.1 Introduction Unlike the monocular case, here we assume the availability of several stereoscopic sensors makes it possible to perceive at dierent moments (stereocopic sequences) the scene composed of 3-D objects provided with 3-D motions from several points of view. Various experimental contexts can be studied: number of sensors: at least two cameras, in order to allow the creation of a stereoscopic eect. This number can be greater (the case of trinocular vision for example 25 was explored) in order to facilitate the matching phase and to identify certain ambiguities more easily. geometry of the stereoscopic system: Most studies which have dealt with this algorithmic theme of stereo-motion cooperation use a stereoscopic system, in which cameras are set out in parallel in a unique plane (i.e., image planes are identical) which assumes a depth focalization at innity, and where the separation of the geometric base of sensors is large (i.e., greater than the distance corresponding to the visual system of about 65mm). These choices clearly are uncompatible with the optimal conditions of the quality of relief perception (see paragraph concerning the use of these techniques in 3-D TV) for which a respect of dierent levels of conformity is conventionally introduced. calibration of the stereoscopic system: this procedure signies the prior identication of the intrinsic parameters of each sensor (focal length, coordinates of the optical sensor, radial distortion factor,... see Chapter 1), as well as the extrinsic parameters matching by a geometric screw (Rrl Trl ) (3-D rotation + 3-D translation) the relative references attached to each sensor (l = \left" sensor, r = \right" sensor in this paragraph). This calibration phase enables: { the establishment equations linking the 2-D pixel coordinates to the 3-D point coordinates 2 3 2 32 3 Zx F X7 x 0 xc 64 Zy 75 = 64 0 F y 7 6 Y 5 (55) 5 4 y c Z 0 0 1 Z where (X Y Z ) designate the coordinates of a 3-D point, (x y ) designate the 2-D pixel coordinates and (xc yc Fx Fy ) are the intrinsic parameters of the sensor (case of a perspective projection sensor model without radial distortion) { the passage of \left" coordinate references to \right" and vice versa 2 3 2 3 X X 64 Y 75 = Rr 64 Y 75 + Tr (56) l l Z r Z l { the denition of epipoles: the right (left resp.) epipole is the projection on the right (left resp.) image plane of the optical center of the left (right resp.) camera. Epipolar lines linking epipoles and optical centers are associated. This epipolar geometry makes it possible to constrain analytically the geometry of the search window during the matching of primitives between left and right images. It is clear that in the absence of any calibration, only fairly rough heuristics can be used: { selection of the optical center at the center of the image { focal parameters xed without identication { search window limited in number of pixels directly in the image plane and hypotheses of horizontal epipolar lines 26 These heuristic selections naturally introduce large sources of error in the motion estimation and disparity algorithms then used. Tamtaoui 47] carried out a study into the robustness of these algorithms faced with such errors or inaccuracies on the calibration parameters. Once these experimental selections have been made, the problem of 3-D or 2-D motion estimation in the context of stereoscopic sequences is then posed in these terms: in the short term at two successive moments (t t + 1), as illustrated in Figure 13, we have four observation elds (in the binocular case dealt with here) of a 3-D primitive P moving in 3-D space, in the case of a rigid object according to the kinematic screw V~ = (T~ ~ ) from these four observation elds various 2-D, 2 12 -D or 3-D information elds can be identied: disparity elds (ftg at time t and ft+1g at time t + 1 respectively) by standard matching primitives techniques n o n o 2-D apparent motion vectors (f lg = d~l on the left sequence and f r g = d~r on the right sequence respectively) by use of a monocular 2-D apparent motion estimation algorithm motion descriptor elds (resp. f lg and f r g ) dependent on a previously dened motion model 3-D motion and structure parameter elds in the monocular case applied here to each stereoscopic sequence. t t t dl ~ V Y M ~ = (T~ ) M t+1 O X Z dr t+1 t+1 Left sequence Right sequence Figure 13: Stereo-motion observation space and associated identiable information elds We will not go back over the estimation techniques of these various information elds which have already been studied in Chapter 3 and at the beginning of this chapter in the monocular case. However, let us remember that the manipulated primitives can be of dierent levels: pixel primitives: the information elds are dense contour or region primitives: the information elds are sparse. 27 Below we discuss more particularly the various sequencing or matching possibilities of these stereo-motion primitive estimation procedures three approaches are distinguished: the rst consists of identifying the 3-D motion of objects by temporal matching of 3-D primitives (the \stereo then 3-D motion" approach) the second consists of starting with 2-D apparent motion elds, independently estimated in each stereoscopic sequence, and then raised again by stereoscopic relation between 3-D motion and structure information elds (\3-D motion then stereo" approach) nally, the third approach, which is meant to be better adapted to the case of the use of these motion estimation techniques in a coding context, carries out the joint estimation of motion descriptor elds (\2-D, 2 12 -D stereo constrained motion" approach) simultaneously in both stereoscopic sequences, by respecting the constraints due to the intrinsic stereoscopic geometry. 3.2 3-D motion by matching 3-D primitives This approach can be arranged as follows: Stage 1: After identication of a disparity eld ftg (resp. ft+1g) throughout the sequence, for every stereoscopic couple of images, a depth map is produced fZt (x y )g (resp. fZt+1 (x y )g) for every image. Stage 2: A matching phase for 3-D primitives, obtained by successive depth maps, is used. Stage 3: Instantaneous depth maps and the matching previously carried out make it possible to deduce the 3-D motions + structure of manipulated primitives. Several authors have studied this type of approach by trying to minimize the number of 3-D primitives to be matched. Leung and Huang 23], Netravali et al 34], and Mitiche and Bouthemy 27] worked on 3-D pixel-based primitives since theoretically three noncolinear points are enough to determine the 3-D motion of a rigid object, a sparse 3-D point depth map is rst formulated by stereo-matching. A temporal matching on one of the stereoscopic sequences then makes it possible to identify the 3-D motion of these points. The raising of certain ambiguities is then eected by the verication on the other stereoscopic sequence, of a matching of projected 3-D points. Kim and Aggarwal 21] base their approach on the joint extraction of depth maps on contour-primitives extracted by zero crossings of Laplacians and on pixel-based primitives by Moravec operator. A two-pass relaxation method (in order to ensure the symmetry of temporal matching) is used to link the 3-D primitive maps of two successive images (t) and (t + 1) the cost function for the relaxation procedure is based on the notion of motion invariants for rigid bodies such as distance ratios or angles between primitives. Lingxiao et al 25] present a method in which the estimation phases of the instantaneous rotation vector and that of translation are uncoupled. Firstly, the centroids of the pixel sets of the left and right views are superposed on this new set of translated points, the rotation vector ~ calculation is carried out by least mean squares method in the case of a planar structure nally the translation vector T~ is deduced from Equation (2) itself. Many other studies have introduced alternative algorithms to those described here. Due to the sparse nature of the processed primitive elds, these stereo-motion cooperation algorithms are intended more particularly for the reconstitution of 3-D objects or as navigation aids for robots by dynamic stereoscopic vision 31], 49]. In stereoscopic 28 sequence coding, it is still necessary to segment and interpret, in terms of motion and 3-D structures, a complete partition of the images, which makes the two complementary approaches developed below more attractive. 3.3 3-D motion based on 2-D motion elds Another approach to the calculation of the 3-D motion and structure parameters is based on the independent and prior combination of estimated 2-D apparent motion elds on each of the stereoscopic sequences. Mitiche 28] starts from the hypothesis of the observation of at least four 3-D points in two stereoscopic sequences. Each point checks the equations 8 2 3 > x > h i > > 64 y l 75 = 0 > x y 1 A > r r l > > < 1 2 3 2 3 (57) > x u l l > h i h i > > > ur vr 0 A 64 yl 75 + xr yr 1 A 64 vl 75 = 0 > > > : 1 0 where A, a 3 3 matrix, depends only on the relative displacement Rrl Trl between the systems of coordinates linked to the stereoscopic cameras. The identication of A (which represents 8 unknown variables after normalization) can be carried out by resolution of the linear system on four observed points. By using the apparent motion eld itself, this solves the problem of calibrating the stereoscopic system. For all other 2-D matched point sets, it will therefore be possible to return to the depth information by simple triangulation and thus to obtain access to the 3-D kinematic screw (T~ ~) by resolution of the system of Equation (5) (linear in T~ and ~ ) once this depth map is known. Waxman et al 53], 54] studied, in particular, the relations between 2-D motion elds. They dene the relative !ow or binocular dierence !ow by d~(xl yl ) = d~r (xl + (xl yl) yl) ; d~l (xl yl) (58) where (xl yl) designates the disparity measure obtained at the current point (xl yl ) of the left view in the case of parallel and aligned cameras (i.e., Zl = Zr and yl = yr at all points), it is expressed by (59) (xl yl) = z (xb y ) l l l where b measures the distance (baseline) between the two stereoscopic sensors. Equation (5) is reformulated, by separating the terms linked with the instantaneous translation T~ and those linked with the rotation ~ by: ! ! d~(x y) = xy__ = uv = Z (x1 y ) A(x y ):T~ + B(x y ):~ (60) From Equations (58) to (60), we deduce the following analytical relation between disparity elds, relative !ow components and 3-D motion (in the case of aligned cameras): 8 u(x y ) 1 l l < (xlyl ) = b TZ + yl !X ; xl !Y : v((xxlyyl)) = 0 l l (61) 29 1 = n x + n y + n then the If a planar structure hypothesis is used, i.e., Z (xy X Y Z ) relations between 3-D motion + structure and disparity elds and relative !ow elds can be established simply by: ( u(xl yl ) (xl yl ) v (xl yl ) = nTZZ + (nX TZ ; !Y )xl + (!X + nY TZ )yl (62) = 0 In order to avoid bias in the estimation of initial 2-D motion elds, the latter are ltered by adapted lters (radial !ow ltering for the relative !ow, 2nd order ltering for the elds themselves) 53]. The 3-D motion estimation method proceeds in accordance with the following principles: stage 1: estimation, segmentation and ltering of 2-D apparent motion elds stage 2: matching of primitives based on coherence equations (62) stage 3: use of disparity functions for the reconstitution of surfaces between discontinuity regions detected during monocular analysis (stage 1) stage 4: estimation of 3-D motion parameters A temporal linking phase is also introduced in order to allow a \sub-pixel" accuracy in the estimated disparity eld (by temporal interpolation) and tracking along the temporal axis of discontinuity regions and matched segmented regions. 3.4 Joint motion estimation under stereoscopic constraints In several applications - and notably those of stereoscopic sequence coding, where 3-D reconstruction is not an aim - it is sometimes not necessary to go back as far as the estimation of explicit 3-D motion and structure parameters. A contrario, it would appear interesting to move on to the 2-D or 2 21 -D motion descriptor estimation phases not independently of each stereoscopic sequence, but jointly by introducing stereoscopic constraints into the estimation schemes themselves, linking the two descriptor elds. In the case where only dense 2-D primitive elds (disparity elds ft g n o nare estimated o ~ ~ and ft+1 g and apparent motion elds l = dl , r = dr ) an available coherence constraint for these elds is to impose, at each point of the image plane, a linear relation: d~l + ~l + d~r + ~r = 0 (63) consisting of forcing the closure of the quadrilateral illustrated in Figure 13. Such a relation makes it possible, knowing three information elds, to deduce the fourth, an ability which is easily applied in the case where, given that the dense disparity elds are calculated on each stereoscopic pair, the knowledge of a motion eld (for example on the left sequence) makes it possible to deduce the other eld (on the right sequence). Tamtaoui and Labit 46] tested this estimation approach. It turns out that this too localized and too major constraint, notably on occlusion regions, can only provide an initial prediction of a eld which then has to be ane to obtain results in motion compensation identical to the monocular case obviously, this post-processing removes the previous stereoscopic constraint. Furthermore, this scheme remains very sensitive to the estimation bias of each of the information elds introduced. 30 An interesting alternative 46], 48] is to begin with a coherence equation linking the apparent motion elds d~l = (ul vl )t and d~r = (ur vr )t under stereoscopic constraints. This relation establishes itself as follows: if 2 3 2 3 X X r l 64 Y 75 = Rr 64 Y 75 + Tr (64) r l l l Zr Zl with Trl = (t1 t2 t3)t and Rrl = ( rij ) with i = 1 2 3 j = 1 2 3, and if we assume that Zl = Zr for all matched pixels (parallel cameras hypothesis), then it is possible to establish the following relation between apparent 2-D motion elds: (r21 ; tt2 r11)ul + (r22 ; tt2 r12)vl = ; tt2 ur + vr (65) 1 1 1 which can be put in the form ul + vl + ur + vr = 0 with: 8 = rt212 ; rt111 > > > < = r22 ; r12 t2 t1 (66) > = t11 > > : = ;1 t2 It is equivalent matrically to C : ( = 0 with: ( = (ul vl ur vr)t the motion vector linked to the two stereoscopic sequences, C coherence coecients. Tamtaoui and Labit 46] introduce this coherence equation within a pel-recursive type estimation scheme by minimization of a reconstitution error quadratic function (;) linked to the left and right sequences by gradient techniques. Namely: ;(( plr ) = DFD2 (pl d~l) + DFD2 (pr d~r ) (67) with plr , a couple of pixels (pl pr ) matched together the estimation algorithm is then written: (k+1 = (k ; P r ;((k ) (68) with P = I ; C T (CC T );1 C . The matrix P is the matrix of projection on the coherence space: n o ( 2 IR4 = C:( = 0 (69) This estimation technique (see Figure 16) compares favourably with monocular independent motion estimation techniques (see Figure 15) and with disparity estimation techniques (see Figure 14) used for compensation schemes. Naturally, this approach on a dense eld extends to region motion descriptor estimation methods (see Figure 17) by the use of parametric motion models 47]. In addition to the more global nature of these descriptors, such an approach appears more robust to estimation bias on the disparity since in this context it is a matter of matching regions and not points. Some results below illustrate the performances achieved using these joint estimation algorithms concerning quality criteria of reconstitution after motion compensation and quality criteria of motion elds obtained. 31 a b Figure 14: (a) Reconstructed \Campagne" image using disparity compensation, (b) Corresponding disparity compensation errors (MSE=54.24) 3.5 Application to coding of stereoscopic sequences (3-D TV) 3.5.1 The general context of 3-D TV As Figure 18 illustrates, a three-dimensional television system (3-D TV) consists of various elements as follows: a stereoscopic capture system (at least two cameras, calibrated or not) a coder-decoder implementing a compression phase for transmission or storage of stereoscopic sequences a 3-D display for which various technologies exist: dual-screens with polarizing lters, glasses with synchronized obturators, lenticular plate screens,... The motion estimation algorithms using stereovision-motion cooperation, mentioned in the previous paragraphs, integrate naturally into such an applicational context in order to analyze stereoscopic source-sequences and code them by motion and/or disparity compensation. 3.5.2 Stereoscopic sequence coding strategies We remain within the context of compatible coding-decoding-restitution approaches, i.e., which permit restoration of a monocular view, if the receiver does not have a 3-D display. Two denitions of compatibility can then be introduced (see Figure 19): 32 a b c Figure 15: (a) Reconstructed \Campagne" image using motion compensation (WalkerRao pel-recursive method), (b) Corresponding motion compensated errors (MSE=7.92), (c) Motion vector eld 1. in the rst approach, we assume the coding of one of the stereoscopic sequences (for example the left as illustrated in Fig 19) by such a standard monocular sequence compression technique. The second sequence will be coded by: disparity compensation 57] (example in Figure 14) motion compensation 47], 10] (examples in Figures 15 to 17). The second coding channel is thus used to transmit compensation errors and if necessary, if the disparity and motion information elds are used non-predictively in the compensation scheme, these should also be transmitted. In this case, an eective stereo-motion cooperation approach makes it possible: to compare the two possible types of compensation 33 a b c Figure 16: (a) Reconstructed \Campagne" right image using joint coherent motion compensation on the two stereoscopic sequences, (b) Corresponding motion compensated errors (MSE=3.73), (c) Motion vector eld to restrict the volume of information which represents these elds by taking account of equations of geometric dependence which link them (coherence equations described just before) to minimize depth perception artefacts which are linked to an independent view-to-view reconstitution by purely monocular approaches. 2. the second approach appears as an attractive, but more dicult to achieve, extension of the previous notion of compatibility. Prior to any coding of stereoscopic sequences, a joint stereo-motion analysis is carried out. From this processing phase are generated, on the one hand, a \compatible" monocular sequence which can be situated as an intermediate position between the viewpoints of the left and right cameras and, on the other hand, innovation information (identical in nature to the compensation error information previously described) with regard to this compat34 a b c Figure 17: (a) Reconstructed \Campagne" right image using joint coherent quadtree-based ane motion estimation on the two stereoscopic sequences, (b) Corresponding motion compensated errors (MSE=15.16), (c) Motion vector eld ible sequence. Such an approach is well adapted to the case of the use of 3-D motion+structure estimation methods which, once carried out, make it possible to synthesize the 3-D scene perceived from all viewing angles. This coding strategy, dicult because of the even more imprecise nature of 3-D parameter estimations obtained on true stereoscopic sequences, can be considered as a natural extension of the Analysis-Synthesis or object-oriented coding approaches, described in paragraph 8.2.6 for simple objects. References 1] G. Adiv, \Determining three-dimensional motion and structure from optical !ow generated by several moving objects", IEEE Trans. on Pattern Analysis and Ma35 Camera 1 STEREOSCOPIC SOURCE DIGITALIZATION Camera 2 ENCODING TRANSMISSION Synchronized or polarized binocular sytem STEREOSCOPIC SOURCE DECODING Lenticular sheet STEREOSCOPIC DISPLAY Figure 18: General scheme of a 3-D TV system Left sequence Right sequence Stereo Analysis Residual Coding Coding Transmission Reconstruction Reconstruction Compatible Image Stereo Image Figure 19: Compatibility approach for transmission of stereoscopic image sequences chine Intelligence, Vol. PAMI-7, pp. 384-401, July 1985. 2] G. Adiv, \Inherent ambiguities in recovering 3D motion and structures from a noisy !ow eld", IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. PAMI11, pp. 477-489, May 1989. 3] K. Aizawa, H. Harashima, and T. Saito, \Model-based analysis synthesis image coding (MBASIC) system for a person's face", Signal Processing: Image Communication, Vol. 1, pp. 139-152, 1989. 36 4] P. Anandan, \A unied perspective on computational techniques for the measurement of visual motion", Proc. of the 1st Int. Conf. on Computer Vision, pp. 219-230, May 1987. 5] J.L. Barron, A.D. Jepson, and J.K. Tsotsos, \The feasibility of motion and structure from noisy time-varying image velocity information", Int. Journal of Computer Vision, pp. 239-269, 1990. 6] P. Bouthemy and J. Santillana-Rivero, \A hierarchical likelihood approach for region segmentation according to motion-based criteria", Proc. of the 1st Int. Conf. on Computer Vision, London, pp. 463-467, 1987. 7] N. Diehl, \Object-oriented Motion Estimation and segmentation in Image Sequences", Signal Processing: Image Communication, Vol. 3, No. 1, pp. 23-56, 1991. 8] E. Dubois, \Motion-compensated ltering of time-varying images", Multidimensional Systems and Signal Processing, No. 3, pp. 211-239, 1992. 9] J.L. Dugelay and B. Choquet, \A 3D image analysis algorithm and stereoscopic television", Proc. of Festival Int. des images 3D, Paris, Sept. 1991. 10] J.L. Dugelay and D. Pele, \Motion and disparity analysis of a stereoscopic sequence: application to 3DTV encoding", European Conference on Signal Processing, EUSIPCO'92, Aug. 1992. 11] R. Forchheimer and O. Fahlander, \Low bit rate coding through animation", Picture Coding Symposium, PCS'83, Davis, March 1983. 12] R. Forchheimer, O. Fahlander, and T. Kronander, \A semantic approach to the transmission of face images", Picture Coding Symposium, PCS'84, Rennes, July 1984. 13] E. Francois and P. Bouthemy, \The derivation of qualitative information in motion analysis", Proc. of the 1st European Conf. on Computer Vision, ECCV'90, pp. 226230, 1990. 14] E. Francois, Interpretation qualitative du mouvement a partir d'une sequence d'images, Ph-D thesis, Universit)e de Rennes-I, June 1991. 15] R. Hartley, \Segmentation of optical !ow elds by pyramid linking", Pattern Recognition Letters, Vol. 3, pp. 253-262, July 1985. 16] M. Hoetter, \Dierential estimation of the global motion parameters zoom and pan", Signal Processing, Vol. 16, pp. 249-265, 1989. 17] B.K.P. Horn and B. Schunck, \Determining optical !ow", Articial Intelligence, Vol. 17, pp. 185-203, 1981. 18] B.K.P. Horn and J.R. Weldon, \Direct methods for recovering motion", Int. Journal of Computer Vision, Vol. 2, pp. 51-76, 1988. 19] M. H'otter, \Object-oriented analysis-synthesis coding based on moving twodimensional objects", Signal Processing: Image Communication, Vol. 2, pp. 409-429, 1990. 37 20] M. Kanado, A. Koike and Y. Hatori, \Codings with knowledge-based analysis of motion pictures", Picture Coding Symposium, PCS'87, Stockholm, June 1987. 21] Y.C. Kim and J.K. Aggarwal, \Determining object motion in a sequence of stereo images", IEEE Journal of Robotics and Automation, Vol. 3, No. 6, pp. 599-614, Dec. 1987. 22] C. Labit and H. Nicolas, \Compact motion representation based on global features for semantic image sequence coding", Proc. of the SPIE Conf. on Visual Communication and Image Processing, VCIP'91, Vol. 2, pp.697-709, Nov. 1991. 23] M. K. Leung and T. S. Huang, \An integrated approach to 3D motion analysis and object recognition", IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. PAMI-13, No. 10, pp. 1075-1084, Oct 1991. 24] S. X. Li and M. H. Loew, \The quadcode and its arithmetic", Communications of the ACM, pp. 621-631, July 1987. 25] L. Lingxiao, T. S. Huang et al., \Motion estimation from 3-D points sets with and without correspondences", Proc. of the Conf. Computer Vision and Pattern Recognition, CVPR'86, pp. 194-201, 1986. 26] H. C. Longuet-Higgins, \A computer algorithm for reconstructing a scene from two projections", Nature, Vol. 293, pp. 133-135, Sept. 1981. 27] A. Mitiche and P. Bouthemy, \Tracking modelled objects using binocular images", Computer Vision, Graphics and Image Processing, Vol. 32, pp. 384-396, 1985. 28] A. Mitiche, \On kineopsis and computation of structure and motion", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-8, No. 1, pp. 109-112, Jan. 1986. 29] Y. Miyamoto and M. Ohta, \Global motion compensation for rotation and zooming image", Proc. of the Picture Coding Symposium, PCS'91, pp. 137-140, Sept. 1991. 30] H.-G. Musmann, M. H'otter and J. Ostermann, \Object-oriented analysis-synthesis coding of moving images", Signal Processing: Image Communication, Vol. 1, pp. 117-138, 1989. 31] N. Navab, Z. Zhang and O. D. Faugeras, \Tracking, motion and stereo", Proc. of the Scandinavian Conf. on Image Analysis, SCIA'91, pp. 98-105, 1991. 32] S. Negahdaripour and A. Yuille, Direct passive navigation, I: analytical solutions for planes, AI Memo 863, MIT Articial Intelligence Lab, August 1985. 33] S. Negahdaripour and A. Yuille, \Direct passive navigation, II: analytical solutions for quadratic patches", Conf. Computer Vision and Pattern Recognition, CVPR'88, pp. 404-410, 1988. 34] A.N. Netravali, T.S. Huang et al, \Algebraic Methods in 3D Motion Estimation from two-view point correspondences", Int. Journal of Imaging Systems and Technology, Vol. 1, pp. 78-99, 1989. 38 35] A. N. Netravali and J. D. Robbins, \Motion compensated television coding: Part I", Bell Syst. Tech. Journal, Vol. 58, No. 3, pp. 631-670, March 1979. 36] A. N. Netravali and J. Salz, \Algorithms for estimation of three-dimensional motion", AT&T Technical Journal, Vol. 64, No. 2, Feb. 1985. 37] H. Nicolas and C. Labit, \Global motion identication for image sequence analysis and coding", Proc. of Int. Cong. on Speech, Acoustics and Signal Processing, ICASSP'91, Vol. 4, pp. 2825-2828, May 1991. 38] H. Nicolas and C. Labit, \Region-based motion estimation using deterministic relaxation schemes for image sequence coding", Proc. Int. Cong. on Speech, Acoustics and Signal Processing, ICASSP'92, Vol. 3, pp. 265-268, March 1992. 39] H. Nicolas, Hierarchie de modeles de mouvement et methodes d'estimation associees. Application au codage de sequences d'images, Ph-D Thesis, Universit)e de Rennes-I, Sept. 1992. 40] J. Rissanen, \Modeling by shortest data description", Automatica, Vol. 14, pp. 465472, 1986. 41] H. Samet, \Quadtree from boundary codes", Communications of the ACM, pp. 163170, March 1980. 42] H. Sanson, \Motion ane models identication and application to television image coding", SPIE Conf. Visual Communication and Image Processing, VCIP'91, Vol. 1605, pp. 570-581, Nov. 1991. 43] J. Santillana-Rivero, P. Bouthemy and C. Labit, \Hierarchical motion-based image segmentation applied to HDTV", 2nd Int. Workshop on Signal Processing of HDTV, l'Aquila, March 1988. 44] P. Y. Simard and G. E. Mailloux, \A projection operator for the restoration of divergence-free vector elds", IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. PAMI-10, No. 2, pp. 248-256, 1988. 45] M. Subbarao and A. M. Waxman, \On the uniqueness of image !ow solutions for planar surfaces in motion", Computer Vision, Graphics and Image Processing, Vol. 36, pp. 208-220, 1986. 46] A. Tamtaoui and C. Labit, \Constrained disparity and motion estimators for 3DTV image sequence coding", Signal Processing: Image Communication, Vol. 4, pp. 4554, 1991. 47] A. Tamtaoui, Cooperation stereovision-mouvement pour la compression de sequences stereoscopiques. Application a la Television en relief (TV3D), Ph-D Thesis, Universit)e de Rennes-I, Oct. 1992. 48] A. Tamtaoui and C. Labit, \Constrained motion estimators for 3D sequence coding", Proc. of the European Conf. on Signal Processing, EUSIPCO'92, Brussels, Aug. 1992. 39 49] M. Tistarelli, E. Grosso and G. Sandini, \Dynamic stereo in visual navigation", Proc. of Conf. Computer Vision and Pattern Recognition, CVPR'91, pp. 186-192, 1991. 50] Y. T. Tse and R. Baker, \Global zoom/pan estimation and compensation for video compression", Proc. of Int. Cong. on Speech, Acoustics and Signal Processing, ICASSP'91, Vol. 4, pp. 2725-2728, May 1991. 51] A. Verri, F. Girosi and V. Torre, \Mathematical properties of the two-dimensional motion eld: from singular points to motion parameters", Journal of Optical Soc. of Am., Vol. 6, No. 5, pp. 698-712, May 1989. 52] A. M. Waxman and K. Wohn, \Contour evolution, neighborhood deformation, and global image !ow: planar surfaces in motion", Int. Journal of Robotics Research, Vol. 4, No. 3, pp. 95-108, 1985. 53] A. M. Waxman and S. Sinha, \Dynamic stereo passive ranging to moving objects from relative image !ows", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-8, No. 4, pp. 406-412, July 1986. 54] A. M. Waxman and J. H. Duncan, \Binocular image !ows: steps forward stereomotion fusion", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-8, No. 6, pp. 715-729, Nov. 1986. 55] A. M. Waxman and K. Wohn, \Image !ow theory: a framework for 3D inference from time-varying imagery", Chapter 3 in Advances in Computer Vision, Erlbaum Associates Ed., London, pp. 164-224, 1988. 56] S. F. Wu and J. Kittler, \A dierential method for simultaneous estimation of rotation, change of scale and translation", Signal Processing: Image Communication, Vol. 2, pp. 69-80, 1990. 57] M. Ziegler, \Disparity estimation using variable blocksize", Proc. of the 3rd COST230 Workshop on 3DTV Signal Processing, Rennes, 1992. 40

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement