Efficient Video Rectification and Stabilisation for Cell-Phones Erik Ringaby and Per-Erik Forssén Linköping University Post Print N.B.: When citing this work, cite the original article. The original publication is available at www.springerlink.com: Erik Ringaby and Per-Erik Forssén, Efficient Video Rectification and Stabilisation for CellPhones, 2012, International Journal of Computer Vision, (96), 3, 335-352. http://dx.doi.org/10.1007/s11263-011-0465-8 Copyright: Springer Verlag (Germany) http://www.springerlink.com/ Postprint available at: Linköping University Electronic Press http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-75277 Noname manuscript No. (will be inserted by the editor) Efficient, High-Quality Video Stabilisation for Cell-Phones Erik Ringaby · Per-Erik Forssén Received: date / Accepted: date Abstract This article presents a method for rectifying and stabilising video from cell-phones with rolling shutter (RS) cameras. Due to size constraints, cell-phone cameras have constant, or near constant focal length, making them an ideal application for calibrated projective geometry. In contrast to previous RS rectification attempts that model distortions in the image plane, we model the 3D rotation of the camera. We parameterise the camera rotation as a continuous curve, with knots distributed across a short frame interval. Curve parameters are found using non-linear least squares over inter-frame correspondences from a KLT tracker. By smoothing a sequence of reference rotations from the estimated curve, we can at a small extra cost, obtain a high-quality image stabilisation. Using synthetic RS sequences with associated ground-truth, we demonstrate that our rectification improves over two other methods. We also compare our video stabilisation with the methods in iMovie and Deshaker. Keywords Cell-phone, Rolling shutter · CMOS · Video stabilisation 1 Introduction Almost all new cell-phones have cameras with CMOS image sensors. CMOS sensors have several advantages over the conventional CCD sensors: notably they are cheaper to manufacture, and typically offer on-chip processing (Gamal and Eltoukhy, 2005), for e.g. automated white balance and auto-focus measurements. E. Ringaby and P.-E. Forssén Department of Electrical Engineering, Linköping University SE-581 83 Linköping, Sweden Tel.: +46(0)13 281302 (Ringaby) E-mail: ringaby@isy.liu.se Fig. 1 Example of rolling shutter imagery. Top left: Frame from an iPhone 3GS camera sequence acquired during fast motion. Top right: Rectification using our rotation method. Bottom left: Rectification using the global affine method. Bottom right: Rectification using the global shift method. Corresponding videos are available on the web (Ringaby, 2010). However, most CMOS sensors, by design make use of what is known as a rolling shutter (RS). In an RS camera, detector rows are read and reset sequentially. As the detectors collect light right until the time of readout, this means that each row is exposed during a slightly different time window. The more conventional CCD sensors on the other hand use a global shutter (GS), where all pixels are reset simultaneously, and collect light during the same time interval. The downside with a rolling shutter is that since pixels are acquired at different points in time, motion of either camera or target will cause geometrical distortions in the acquired images. Figure 1 shows an example of geometric distor- 2 tions caused by using a rolling shutter, and the result of our rectification step, as well as two other methods. 1.1 Related Work A camera motion between two points in time can be described with a three element translation vector, and a 3 degrees-of-freedom (DOF) rotation. For hand-held footage, the rotation component is typically the dominant cause of image plane motion. Whyte et al. (2010) gave a calculation example of this for the related problem of motion blur during long exposures. (A notable exception where translation is the dominant component is footage from a moving platform, such as a car.) Many new camcorders thus have mechanical image stabilisation (MIS) systems that move the lenses (some instead move the sensor) to compensate for small pan and tilt rotational motions (image plane rotations, and large motions, are not handled). The MIS parameters are typically optimised to the frequency range caused by a person holding a camera, and thus work well for such situations. However, since lenses have a certain mass, and thus inertia, MIS has problems keeping up with faster motions, such as caused by vibrations from a car engine. Furthermore, cell-phones, and lower end camcorders lack MIS and recorded videos from these will exhibit RS artifacts. For cases when MIS is absent, or non-effective, one can instead do post-capture image rectification. There exist a number of different approaches for dealing with special cases of this problem (Chang et al., 2005; Liang et al., 2008; Nicklin et al., 2007; Cho and Kong, 2007; Cho et al., 2007; Chun et al., 2008). Some algorithms assume that the image deformation is caused by a globally constant translational motion across the image (Chang et al., 2005; Nicklin et al., 2007; Chun et al., 2008). After rectification this would correspond to a constant optical flow across the entire image, which is rare in practise. Liang et al. (2008) improve on this by giving each row a different motion, that is found by interpolating between constant global inter-frame motions using a Bézier curve. Another improvement is due to Cho and Kong (2007) and Cho et al. (2007). Here geometric distortion is modelled as a global affine deformation that is parametrised by the scan-line index. Recently Baker et al. (2010) improved on Cho and Kong (2007) and Liang et al. (2008) by using not one model per frame, but instead blended linearly between up to 30 affine or translational models across the image rows. This means that their model can cope with motions that change direction several times across a frame. This was made possible by using a high-quality dense optical flow field. They also used a L1 optimisation based on linear programming. However, they still model distortions in the image plane. All current RS rectification approaches perform warping of individual frames to rectify RS imagery. Note that a perfect compensation under camera translation would require the use of multiple images, as the parallax induced by a moving camera will cause occlusions. Single frame approximations do however have the advantage that ghosting artifacts caused by multi-frame reconstruction is avoided, and is thus often preferred in video stabilisation (Liu et al., 2009). Other related work on RS images include structure and motion estimation. Geyer et al. (2005) study the projective geometry of RS cameras, and also describe a calibration technique for estimation of the readout parameters. The derived equations are then used for structure and motion estimation in synthetic RS imagery. Ait-Aider et al. (2007) demonstrate that motion estimation is possible from single rolling shutter frames if world point-to-point distances are known, or from curves that are known to be straight lines in the world. They also demonstrate that structure and motion estimation can be done from a single stereo pair if one of the used cameras has a rolling shutter (Ait-Aider and Berry, 2009). Video stabilisation has a long history in the literature, an early example is Green et al. (1983). The most simple approaches apply a global correctional image plane translation (Green et al., 1983; Cho and Kong, 2007). A slightly more sophisticated approach is a global affine model. A special case of this is the zoom and translation model used by Cho et al. (2007). A more sophisticated approach is to use rotational models. Such approaches only compensate for the 3D rotational motion component, and neglect the translations, in a similar manner to mechanical stabilisation rigs (Yao et al., 1995). Rotational models estimate a compensatory rotational homography, either using instantaneous angular velocity (Yao et al., 1995) (differential form), or using inter-frame rotations, e.g. represented as unit quaternions (Morimoto and Chellappa, 1997). Since only rotations are corrected for, there are no parallax-induced occlusions to consider, and thus single-frame reconstructions are possible. The most advanced (and computationally demanding) video stabilisation algorithms make use of structurefrom-motion (SfM). An early example is the quasi-affine SfM explored by Buehler et al. (2001). These methods attempt to also correct for parallax changes in the stabilised views. This works well on static scenes, but introduces ghosting when the scene is dynamic, as blending from multiple views is required (Liu et al., 2009). 3 A variant of SfM based stabilisation is the content preserving warps introduced by Liu et al. (2009). Here single frames are used in the reconstruction, and geometric correctness is traded for perceptual plausibility. Recently Liu et al. (2011a) presented a new stabilisation approach based on subspace constraints on 2D feature trajectories. This has the advantage that it does not rely on SfM, which is computationally heavy and sensitive to rolling shutter artifacts. The algorithm does not explicitly model a rolling shutter, instead it is treated as noise. The new algoritm can deal with rolling shutter wobble from camera shake, but not shear introduced by a panning motion. 1.2 Contributions All the previous approaches to rectification of RS video (Chang et al., 2005; Liang et al., 2008; Nicklin et al., 2007; Cho and Kong, 2007; Cho et al., 2007; Chun et al., 2008; Baker et al., 2010) model distortions as taking place in the image plane. We instead model the 3D camera motion using calibrated projective geometry. In this article, we focus on a distortion model based on 3D camera rotation, since we have previously shown that it outperforms a combined rotation and translation model (Forssén and Ringaby, 2010). In this article, we extend the rotational model to use multiple knots across a frame. This enables the algorithm to detect non-constant motions during a frame capture. We also introduce a scheme for positioning of the knots that makes the optimisation stable. We demonstrate how to apply smoothing to the obtained rotation sequence to obtain a high quality video stabilisation, at a low computational cost. This results in a stabilisation similar to rotational models previously used on global shutter cameras. However, as we perform the filtering offline, the amount of smoothing can be decided by the user, post capture. This can e.g. be done using a slider in a video editing application running on a cell-phone. We compare our video stabilisation with the methods in iMovie and Deshaker. We also introduce a rectification technique using forward interpolation. Many new smartphones have hardware for graphics acceleration, e.g. using the OpenGL ES 2.0 application programming interface. Such hardware can be exploited using our forward interpolation technique, to allow rectification and stabilisation during video playback. We recently introduced the first (and currently only) dataset for evaluation of algorithms that rectify rolling shutter video (Ringaby, 2010). We now extend it with another 2 sequences, and add a more sophisticated com- parison between rectifications and ground truth. Using the dataset, we compare our own implementations of the global affine model (Liang et al., 2008), and the global shift model (Chun et al., 2008) to the new method that we propose. Our dataset, evaluation code and supplementary videos are available for download at (Ringaby, 2010). 1.3 Overview This article is organised as follows: Section 2, describes how to calibrate a rolling-shutter camera. Section 3 introduces the model and cost function for camera egomotion estimation. Section 4 discusses interpolation schemes for rectification of rolling-shutter imagery. Section 5 describes how to use the estimated camera trajectory for video stabilisation. Section 6 describes the algorithm complexity and cell-phone implementation feasibility. Section 7 describes our evaluation dataset. In section 8 we use our dataset to compare different rectification strategies, and to compare our 3D rotation model to our own implementations of Liang et al. (2008) and Chun et al. (2008). We also compare our stabilisation to iMovie and Deshaker. In section 9 we describe the supplemental material and discuss the performance of the algorithms. The article concludes with outlooks and concluding remarks in section 10. 2 Rolling Shutter Camera Calibration In this article, we take the intrinsic camera matrix, the camera frame-rate and the inter-frame delay to be given. This reduces the number of parameters that need to be estimated on-line, but also requires us to calibrate each camera before the algorithms can be used. On camera equipped cell-phones, such calibration makes good sense, as the parameters stay fixed (or almost fixed, in the case of variable focus cameras) throughout the lifetime of a unit. We have even found that transferring calibrations between cell-phones of the same model works well. 2.1 Geometric Calibration A 3D point, X, and its projection in the image, x, given in homogeneous coordinates, are related according to x = KX , and X = λK−1 x , (1) where K is a 5DOF upper triangular 3 × 3 intrinsic camera matrix, and λ is an unknown scaling (Hartley and Zisserman, 2000). 4 We have estimated K for a number of cell-phones using the calibration plane method Zhang (2000) as implemented in OpenCV. Note that many high-end cell-phones have variable focus. As this is implemented by moving the single lens of the camera back and forth, the camera focal length will vary slightly. On e.g. the iPhone 3GS the field-ofview varies with about 2◦ . However, we have found that such small changes in the K matrix makes no significant difference in the result. 2.2 Readout Time Calibration The RS chip frame period 1/f (where f is the frame rate) is divided into a readout time tr , and an interframe delay, td as 1/f = tr + td . (2) The readout time can be calibrated by imaging a flashing light source with known frequency (Geyer et al., 2005), see appendix A for details. For a given frame rate f , the inter-frame delay can then be computed using (2). For our purposes it is preferable to use rows as fundamental unit, and express the inter-frame delay as a number of blank rows: Nb = Nr td /(1/f ) = Nr (1 − tr f ) . (3) Here Nr is the number of image rows. R1 Rm tr td RM t Fig. 2 Rotations, R2 , R3 , . . . , RM found by non-linear optimisation. Intermediate rotations are defined as interpolations of these, and R1 = I. Readout time tr , and inter-frame delay td are also shown. y2 /y3 + Nr + Nb , where Nr is the number of image rows in a frame, and Nb is defined in (3). Each correspondence between the two views, (5) gives us two equations (after elimination of the unknown scale) where the unknowns are the rotations. Unless we constrain the rotations further, we now have six unknowns (a rotation can be parametrised with three parameters) for each correspondence. We thus parametrise the rotations with an interpolating linear spline with a number of knots placed over the current frame window, see figure 2 for an example with three frames and M = 6 knots. Intermediate rotations are found using spherical linear interpolation (Shoemake, 1985). As we need a reference world frame, we might as well fixate that to the start of frame 1, i.e. set R1 = I. This gives us 3(M − 1) unknowns in total for a group of M knots. 3 Camera Motion Estimation Our model of camera motion is a rotation about the camera centre during frame capture, in a smooth, but time varying way. Even though this model is violated across an entire video clip, we have found it to work quite well if used on short frame intervals of 2-4 frames (Forssén and Ringaby, 2010). We represent the model as a sequence of rotation matrices, R(t) ∈ SO(3). Two homogeneous image points x, and y, that correspond in consecutive frames, are now expressed as: x = KR(Nx )X , and y = KR(Ny )X (4) where Nx and Ny correspond to the time parameters for point x and y respectively. This gives us the relation: T x = KR(Nx )R (Ny )K −1 y. (5) The time parameter is a linear function of the current image row (i.e. x2 /x3 and y2 /y3 ). Thus, by choosing the unit of time as image rows, and time zero as the top row of the first frame, we get Nx = x2 /x3 for points in the first image. In the second image we get Ny = 3.1 Motion Interpolation Due to the periodic structure of SO(3) the interpolation is more complicated than regular linear interpolation. We have chosen to represent rotations as three element vectors, n, where the magnitude, φ, corresponds to the rotation angle, and the direction, n̂, is the axis of rotation, i.e. n = φn̂. This is a minimal parametrisation of rotations, and it also ensures smooth variations, in contrast to e.g. Euler angles. It is thus suitable for parameter optimisation. The vector n can be converted to a rotation matrix using the matrix exponent, which for a rotation simplifies to Rodrigues formula: R = expm(n) = I + [n̂]x sin φ + [n̂]2x (1 − cos φ) 0 −n3 n2 1 where [n̂]x = n3 0 −n1 . φ −n2 n1 0 (6) (7) Conversion back to vector form is accomplished through the matrix logarithm in the general case, but for a rotation matrix, there is a closed form solution. We note 5 that two of the terms in (6) are symmetric, and thus terms of the form rij − rji will come from the antisymmetric part alone. Conversely the trace is only affected by the symmetric parts. This allows us to extract the axis and angle as: r32 − r23 ñ = r13 − r31 n = logm(R) = φn̂ , where r21 − r12 φ = tan−1 (||ñ||, trR − 1) n̂ = ñ/||ñ|| . (8) It is also possible to extract the rotation angle from the trace of R alone (Park and Ravani, 1997). We recommend (8), as it avoids numerical problems for small angles. Using (6) and (8), we can perform SLERP (Spherical Linear intERPolation) (Shoemake, 1985) between two rotations n1 and n2 , using an interpolation parameter τ ∈ [0, 1] as follows: ndiff = logm (expm(−n1 )expm(n2 )) Rinterp = expm(n1 )expm(τ ndiff ) We denote the evaluation of the spline at a row value Ncurr by: M R = SPLINE({nm , Nm }1 , Ncurr ) . (15) The value is obtained using SLERP as: R = SLERP(nm , nm+1 , τ ) , for (16) Ncurr − Nm , where Nm ≤ Ncurr ≤ Nm+1 . (17) τ= Nm+1 − Nm Here SLERP is defined in (9,10), Ncurr is the current row relative to the top row in first frame and Nm , Nm+1 are the two neighbouring knot times. For speed, and to ensure that the translation component of the camera motion is negligible, we have chosen to optimise over short intervals of N = 2, 3 or 4 frames. We have chosen to place the knots on different height for the different frames within the frame interval, see figure 3 for two examples. If the knots in two consecutive frames have the same height, the optimisation might drift sideways without increasing the residuals. (9) (10) 3.2 Optimisation frame 1 frame 2 frame 3 We now wish to solve for the unknown motion parameters, using iterative minimisation. For this we need a cost function: J = (n1 , . . . , nN ). (11) To this end, we choose to minimise the (symmetric) image-plane residuals of the set of corresponding points xk ↔ yk : J= K X d(xk , Hyk )2 + d(yk , H−1 xk )2 (12) k=1 where H = KR(xk )RT (yk )K−1 (13) Fig. 3 Top: Frame interval of 2 frames with 6 knots. Bottom: Frame interval of 3 frames with 9 knots. There is a simple way to initialise a new interval from the previous one. Once the optimiser has found a solution for a group of frames, we change the origin to the second camera in the sequence (see figure 2). This shift of origin is described by the rotation: Here the distance function d(x, y) for homogeneous vectors, is given by: Rshift = SPLINE({nm , Nm }1 , Nr + Nb ) , d(x, y)2 = (x1 /x3 − y1 /y3 )2 + (x2 /x3 − y2 /y3 )2 . (14) Then we shift the interval one step, by re-sampling the M spline knots {nm }1 , with an offset of Nr + Nb For each frame interval we have M knots for the linear interpolating spline. The first knot is at the top row of the fist frame, and the last knot at the bottom row of the last frame in the interval. The position of a knot, the knot time (Nm ), is expressed in the unit rows, where rows are counted from the start of the first frame in the interval. E.g. a knot at the top row of the second frame would have the knot time Nm = Nr + Nb and correspond to the rotation nm . M M Rm = SPLINE({nk , Nk }1 , Nm + Nr + Nb ) , (18) (19) and finally correct them for the change of origin, using n0m = logm(RTshift Rm ) , for m = 1, . . . , L , (20) where NL is the last time inside the frame interval. As initialisation of the rotations in the newly shifted-in frame, we copy the parameters for the last valid knot, n0L . 6 3.3 Point Correspondences The point correspondences needed to estimate the rotations are obtained through point tracking. We use Harris points (Harris and Stephens, 1988) detected in the current frame and the chosen points are tracked using the KLT tracker (Lucas and Kanade, 1981; Shi and Tomasi, 1994). The KLT tracker uses a spatial intensity gradient search which minimises the Euclidean distance between the corresponding patches in the consecutive frames. We use the scale pyramid implementation of the algorithm in OpenCV. Using pyramids makes it easier to detect large movements. To increase the accuracy of the point tracker, a crosschecking procedure is used (Baker et al., 2007). When the points have been tracked from the first image to the other, the tracking is reversed and only the points that return to the original position (within a threshold) are kept. The computation cost is doubled but outliers are removed effectively. 3.3.1 Correspondence Accuracy Issues If the light conditions are poor, as is common indoors, and the camera is moving quickly, the video will contain motion blur. This is a problem for the KLT tracker, which will have difficulties finding correspondences and many tracks will be removed by the crosschecking step. If the light conditions are good enough for the camera to have a short exposure, and a frame has severe rolling shutter artifacts, the KLT tracker may also have problems finding any correspondences due to local shape distortions. We have however not seen this in any videos, not even during extreme motions. The tracking can on the other hand result in a small misalignment or a lower number of correspondences, when two consecutive frames have very different motions (and thus different local appearances). It may be possible to reduce the influence of rollingshutter distortion on the tracking accuracy, by detecting and tracking points again, in the rectified images. This is however difficult because the inverse mapping back to the original image is not always one-to-one and it may thus not be possible to express these new correspondences in the original image grid. How to improve the measurements in this way is an interesting future research possibility. 4 Image Rectification Once we have found our sequence of rotation matrices, we can use them to rectify the images in the sequence. Each image row has experienced a rotation according to (15). This rotation is expressed relative to the start of the frame, and applying it to the image points would thus rectify all rows to the first row. We can instead align them all to a reference row Rref (we typically use the middle row), using a compensatory rotation: R0 (x) = Rref RT (x) . (21) This gives us a forward mapping for the image pixels: x0 = KRref RT (x)K−1 x (22) This tells us how each point should be displaced in order to rectify the scene. Using this relation we can transform all the pixels to their new, rectified locations. We have chosen to perform the rectifying interpolation by utilising the parallel structure of the GPU. A grid of vertices can be bound to the distorted rolling shutter image and rectified with (22) using the OpenGL Shading Language (GLSL) vertex shader. On a modern GPU, a dense grid with one vertex per pixel is not a problem, but on those with less computing power, e.g. mobile phones with OpenGL ES 2.0, the grid can be heavily down-sampled without loss of quality. We have used a down-sampling factor of 8 along the columns, and 2 along the rows. This gave a similar computational cost as inverse interpolation on Nvidia G80, but without noticeable loss of accuracy compared to dense forward interpolation. By defining a grid larger than the input image we can also use the GPU texture out-of-bound capability to get a simple extrapolation of the result. It is also tempting to use regular, or inverse interpolation, i.e. invert (22) to obtain: x = KR(x)RTref K−1 x0 . (23) By looping over all values of x0 , and using (23) to find the pixel locations in the distorted image, we can cubically interpolate these. If we have large motions this approach will however have problems since different pixels within a row should be transformed with different homographies, see figure 4. Every pixel within a row in the distorted image does however share the same homography as we described in the forward interpolation case. For a comparison between the results of forward and inverse interpolation see figure 5. 5 Video Stabilisation Our rectification algorithm also allows for video stabilisation, by smoothing the estimated camera trajectory. The rotation for the reference row Rref in section 4 is typically set to the current frame’s middle row. If we instead do a temporal smoothing of these rotations for a sequence of frames we get a more stabilised video. 7 Different homographies row i i+1 i+2 Hf Tracking IP detection Smoothing Optimisation Rectification Fig. 4 Left: RS distorted image. Right: Output image. Fig. 6 Overview of algorithm complexity. The width of each box roughly corresponds to the fraction of time the algorithm spends there during one pass of optimisation, and one pass of smoothing and playback. the start and end of the rotation sequence, we extend the sequence by replicating the first (and last) rotation a sufficient number of times. The output of (26) is not guaranteed to be a rotation matrix, but this can be enforced by constraining it to be orthogonal (Gramkow, 2001): R̂k = USVT , where (27) T UDV = svd(R̃k ) , and S = diag(1, 1, |U||V|) . Fig. 5 Example of forward and inverse interpolation with large motion. Top left: Rolling shutter frame. Top right: Ground truth. Bottom left: Rectification using inverse interpolation. Bottom right: Rectification using forward interpolation. 5.1 Rotation Smoothing Averaging on the rotation manifold is defined as: R∗ = arg min X dgeo (R, Rk )2 (24) R∈SO(3) k 5.2 Common Frame of Reference Since each of our reference rotations Rref (see section 4) has its own local coordinate system, we have to transform them to a common coordinate system before we can apply rotation smoothing to them. We do this using the Rshift,k matrices in (18) that shift the origin from one frame interval to the next. Using these, we can recursively compute shifts to the absolute coordinate frame as: Rabs,k = Rabs,k−1 Rshift,k , and Rabs,1 = I . (28) where dgeo (R, Rk )2 = ||logm(RT1 R2 )||2fro (25) is the geodesic distance on the rotation manifold (i.e. the relative rotation angle), || · ||fro is the Fröbenius matrix norm, and logm is defined in (8). Finding R using (24) is iterative, and thus slow, but Gramkow (2001) showed that the barycentre of either the quaternions or rotation matrices are good approximations. We have tried both methods, but only discuss the rotation matrix method here. The rotation matrix average is given by: R̃k = n X Here Rabs is the shift to the absolute coordinate system, and Rshift is the local shift computed in (18). We can now obtain reference rotations in the absolute coordinate frame: R0ref,k = Rabs,k Rref,k . (29) After smoothing these, using (26) and (27), we change back to the local coordinate system by multiplying them with RTabs,k . 6 Algorithm Complexity wl Rk+l (26) l=−n where the temporal window is 2n + 1 and w are weights for the input rotations Rk . We have chosen wl as a Gaussian filter kernel. To avoid excessive correction at The rectification and stabilisation pipeline we have presented can be decomposed into five steps, as shown in figure 6. We have analysed the computational complexity of these steps on our reference platform, which is a HP Z400 workstation, running at 2.66GHz (using one 8 detection time(ms) 6.2 KLT Tracking tracking time(ms) 80 60 10 40 5 20 0 0 0 100 200 300 400 500 0 optimisation time(ms) 100 200 300 400 500 The KLT tracker (see section 3.3) has a time complexity of O(K × I), where K is the number of tracked points, and I is the number of iterations per point. The average absolute timing for this step is: 26.9 ms/frame including crosschecking rejection. The tracking time as a function of number of interest points is plotted in figure 7. smoothing time(ms) 15 quaternion matrix 30 10 20 5 10 0 6.3 Spline Optimisation 0 100 200 300 400 0 0 50 100 150 Fig. 7 Execution times for algorithm steps. Top left: Detection as function of number of points. Top right: Tracking as function of number of points. Bottom left: Optimisation as function of number of accepted trajectories. Dots are individual measurements, solid lines are least-squares line fits. Bottom right: Rotation smoothing time as function of σ in the Gaussian kernel. core). The graphics card used in the rectification step is an Nvidia GeForce 8800 GTS. As can be seen in figure 6, most computations are done in three preprocessing steps. These are executed once for each video clip: (1) Interest-point detection, (2) KLT tracking and crosschecking, and (3) Spline optimisation. We have logged execution times of these steps during processing of a 259 frame video clip, captured with an iPhone 3GS (The stabilisation example video in our CVPR paper (Forssén and Ringaby, 2010)). This video results in a highly variable number of trajectories (as it has large portions of textureless sky), and thus gives us good scatter plots. We present these timings in figure 7. The remaining two steps are (4) Rotation smoothing, and (5) Playback of stabilised video. Although these are much less expensive, these steps must be very efficient, as they will be run during interaction with a user when implemented in a cell-phone application. In the following subsections, we will briefly analyse the complexity of each of the five steps. The spline optimisation (see section 3.2) has a time complexity of O(N × I × M × F ), where N is the number of points remaining after crosschecking rejection, I is the number of iterations, M is the number of knots used in the frame interval, and F is the number of frames in the interval. Average absolute timing: 14.4 ms/frame, when using M = 3 and the levmar library1 . The optimisation time as a function of number of accepted trajectories is plotted in figure 7. 6.4 Rotation Smoothing Rotation smoothing (see section 5.1) is run each time the user changes the amount of stabilisation desired. It is linearly dependent on the length of the video clip, and the kernel size. Figure 7, bottom right, shows a plot of execution times over a 259 frame video clip (averaged over 10 runs). 6.5 Playback of Stabilised Video The rectification and visualisation step is run once for each frame during video playback. This part is heavily dependent on the graphics hardware and its capability to handle calculations in parallel. The execution time can approximately be kept constant, even though the resolution of the video is changed, by using the same number of vertices for the vertex shader. 6.6 Cell-phone Implementation Feasibility 6.1 Interest-Point Detection The time consumption for the Harris interest point detection is fairly insensitive to the number of points detected. Instead it is roughly linear in the number of pixels. If needed, it can with a small effort be implemented in OpenGL. Average absolute timing: 11.4 ms/frame. Our aim is to allow this algorithm to run on mobile platforms. It is designed to run in a streaming fashion by making use of the sliding frame window. This has the advantage that the number of parameters to be estimated is kept low together with a low requirement of memory, which is limited on mobile platforms. 1 http://www.ics.forth.gr/∼lourakis/levmar/ 9 The most time-consuming part of our algorithm is usually the non-linear optimisation step when optimising for many knots. Our sliding window approach does however enable us to easily initialise the new interval from the previous one. Since cell-phone cameras have constant focal length we do not need to optimise for this. The KLT tracking step also takes a considerable amount of the total time. The time for both tracking and optimisation can however be regulated by changing the Harris detector to select fewer points. The most time critical part is the stabilisation and the visualisation, because when the camera motion has been estimated, the user wants a fast response when changing the stabilisation settings. The stabilisation part is already fast, and since the visualisation is done on a GPU it is possible to play it back in real-time by downsampling the vertex grid to match the current hardware capability. 7 Synthetic Dataset In order to do a controlled evaluation of algorithms for RS compensation we have generated eight test sequences, available at (Ringaby, 2010), using the Autodesk Maya software package. Each sequence consists of 12 RS distorted frames of size 640 × 480, corresponding ground-truth global shutter (GS) frames, and visibility masks that indicate pixels in the ground-truth frames that can be reconstructed from the corresponding rolling-shutter frame. In order to suit all algorithms, the ground-truth frames and visibility masks come in three variants: for rectification to the time instant when the first, middle and last row of the RS frame were imaged. Each synthetic RS frame is created from a GS sequence with one frame for each RS image row. One row in each GS image is used, starting at the top row and sequentially down to the bottom row. In order to simulate an inter-frame delay, we also generate a number of GS frames that are not used to build any of the RS frames. The camera is however still moving during these frames. We have generated four kinds of synthetic sequences, using different camera motions in a static scene, see figure 8. The four sequence types are generated as follows: #1 In the first sequence type, the camera rotates around its centre in a spiral fashion, see figure 8 top left. Three different versions of this sequence exist to test the importance of modelling the inter-frame delay. The different inter-frame delays are Nb = 0, 20 and 40 blank rows (i.e. the number of unused GS frames). Fig. 8 The four categories of synthetic sequences. Left to right, top to bottom: #1 rotation only, #2 translation only, # full 3DOF rotation. and #4 3DOF rotation and translation. #2 In the second sequence type, the camera makes a pure translation to the right and has an inter-frame delay of 40 blank rows, see figure 8 top right. #3 In the third sequence type the camera makes an up/down rotating movement, with a superimposed rotation from left to right, see figure 8 bottom left. There is also a back-and-forth rotation with the viewing direction as axis. #4 The fourth sequence type is the same as the third except that a translation parallel to the image plane has been added, see figure 8 bottom right. There are three different versions of this type, all with different amounts of translation. For each frame in the ground-truth sequences, we have created visibility masks that indicate pixels that can be reconstructed from the corresponding RS frame, see figure 9. These masks were rendered by inserting one light source for each image row, into an otherwise dark scene. The light sources had a rectangular shape that illuminates exactly the part of the scene that was imaged by the RS camera when located at that particular place. To acquire the mask, a global shutter render is triggered at the desired location (e.g. corresponding to first, middle or last row in the RS-frame). 8 Experiments 8.1 Choice of Reference Row It is desirable that as many pixels as possible can be reconstructed to a global shutter frame, and for this purpose, rectifying to the first row (as done by Chun et al. (2008) and Liang et al. (2008)), middle row, or 10 Fig. 9 Left to right: Rendered RS frame from sequence of type #2, with Nb = 40 (note that everything is slightly slanted to the right), corresponding global-shutter ground-truth, and visibility mask with white for ground-truth pixels that were seen in the RS frame. 1 1 0.98 0.98 0.96 0.96 0.94 0.94 start mid end 0.92 0.9 2 4 6 8 10 12 0.9 1 1 0.98 0.98 0.96 0.96 0.94 0.94 0.92 0.92 0.9 2 4 6 8 10 start mid end 0.92 12 0.9 2 4 6 8 10 12 with the rectification output. This error measure has the disadvantage that it is more sensitive in high-contrast regions, than in regions with low contrast. It is also overly sensitive to sub-pixel rectification errors. For these reasons, we now instead use a variance-normalised, error measure: (Irec ) = 3 X (µk − Irec,k )2 k=1 σk2 + εµ2k . (30) Here µk and σk are the means and standard deviations of each colour band in a small neighbourhood of the ground truth image pixel (we use a 3×3 region), and ε is a small value that controls the amount of regularisation. We use ε = 2.5e − 3, which is the smallest value that suppresses high values in homogeneous regions such as the sky. The error measure (30) is invariant to a simultaneous scaling of the reconstructed image and the groundtruth image, and thus automatically compensates for contrast changes. 8.2.2 Statistical Interpretation of the Error 2 4 6 8 10 12 Fig. 10 Comparison of visibility masks for sequences #1, #2, #3, and #4 with B = 40. Plots show the fraction of non-zero pixels of the total image size. last image row make a slight difference. The dataset we have generated allows us to compare these three choices. In figure 10, we have plotted the fraction of visible pixels for all frames in one sequence from each of our four camera motion categories. As can be seen in this plot, reconstruction to the middle row gives a higher fraction of visible pixels for almost all types of motion. This result is also intuitively reasonable, as the middle row is closer in time to the other rows, and thus more likely to have a camera orientation close to the average. 8.2 Rectification Accuracy We have compared our methods to the global affine model (GA) (Liang et al., 2008), and the global shift model (GS) (Chun et al., 2008) on our synthetic sequences, see section 7. 8.2.1 Contrast Invariant Error Measure When we introduced the RS evaluation dataset (Forssén and Ringaby, 2010) we made use of a thresholded Euclidean colour distance to compare ground-truth frames If we assume that the colour channels are Gaussian and independent, we get ∈ X32 , and this allows us to define a threshold in terms of probabilities. E.g. a threshold t that accepts 75% of the probability mass is found from: Z t Z tr −/2 0.75 = pX32 ()d = e d . (31) 2π 0 0 This results in t = 4.11, which is the threshold that we use. 8.2.3 Sub-Pixel Shifts An additional benefit of using (30) is that the sensitivity to errors due to sub-pixel shifts of the pixels is greatly reduced. We used bicubic resampling to test sub-pixel shifts in the range ∆x, ∆y ∈ [0.5, 0.5] (step size 0.1) on sequence #2. For (30), we never saw more than 0.05% pixels rejected in a shifted image. The corresponding figure for the direct Euclidean distance is 2.5% (using a threshold of 0.3 as in (Forssén and Ringaby, 2010)). 8.2.4 Accuracy Measure As accuracy measure we use the fraction of pixels where the error in (30) is below t. The fraction is only computed within the visibility mask. For clarity of presentation, we only present a subset of the results on our synthetic dataset. As a baseline, all plots contain the errors for uncorrected frames represented as continuous vertical lines for the means and dashed lines for the standard deviations. 11 8.2.5 Results house_rot1_B40 4im20rots As our reconstruction solves for several cameras in each frame interval, we get multiple solutions for each frame (except in outermost frames). If the model accords with the real motion, the different solutions for the current frame are similar, and we have chosen to present all results for the first reconstruction. The size of the temporal window used in the optimisation has been studied. The longer the window, the less the rotation-only assumption will hold, and a smaller number of points can be tracked through the frames. In figure 11 the result for two of the sequences can be seen, where the number of frames varied between 2 and 4. In each row, the mean result of all available frames is represented by a diamond centre. Left and right diamond edges indicate the standard deviation. The result for different numbers of knots can also be seen in figure 11. For fewer knots the optimisation is faster, but the constant motion between sparsely spaced knots is less likely to hold. For many knots, the optimisation time will not only increase (and become more unstable), the probability to have enough good points within the corresponding image region also declines. On pure rotation scenes it is also interesting to compare the output rotations with the corresponding ground truth. In figure 12 an example from sequence type #3 with four knots spaced over four frames is shown, together with the ground truth rotations. The rotations are represented with the three parameters described in section 3.1. A comparison between different numbers of knots and the ground thruth is shown in figure 13, using the geodesic distance defined in (25). From figure 11 we see that 6 knots over 2 frames, 9 knots over 3 frames and 12 knots over 4 frames give good results. We have also seen good result on real data with these choices and have chosen to do the remaining experiments with the 2 and 3 frame configurations. For the pure rotation sequences (type #1 and #3) we get almost perfect results, see figures 14 and 15. The GS and GA methods however, do no improvement to the unrectified frames. Figure 16 shows the results from three sequences of type #4 with different amounts of translation. Our methods work best with little translation, but they are still better than the unrectified baseline, as well as the GS and GA methods. In the bottom figure the amount of translation is big and would not be possible to achieve while holding a device in the hand. An even more extreme case is sequence type # 2, where the camera motion is pure translational. Results for this can be seen in figure 17. This kind of motion only gives rolling shutter effects if the camera translation is fast, e.g. video captured from a car. To better 4im16rots 4im12rots 4im8rots 4im5rots 4im4rots 3im15rots 3im12rots 3im9rots 3im6rots 3im4rots 3im3rots 2im12rots 2im10rots 2im8rots 2im6rots 2im4rots 2im3rots 2im2rots 1 0.975 0.95 0.925 0.9 0.875 0.85 0.825 0.8 0.775 0.75 0.825 0.8 0.775 0.75 house_rot2_B40 4im20rots 4im16rots 4im12rots 4im8rots 4im5rots 4im4rots 3im15rots 3im12rots 3im9rots 3im6rots 3im4rots 3im3rots 2im12rots 2im10rots 2im8rots 2im6rots 2im4rots 2im3rots 2im2rots 1 0.975 0.95 0.925 0.9 0.875 0.85 Fig. 11 Top: sequence type #1. Bottom: sequence type #3. Two examples of rectification results with different number of knots and window sizes. Diamond centres are mean accuracy, left and right indicate the standard deviation. The continuous line is the unrectified mean and the dashed lines show the standard deviation. 0.15 0.1 0.05 0 −0.05 −0.1 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Fig. 12 Sequence type #3 with four knots spaced over four frames. Continuous line: ground truth rotations. Dashed lines: result rotations. Vertical black lines define beginning of frames and vertical magenta defines end of frames. 12 −4 1.5 house_trans_rot1_B40_t05 x 10 4 rotations 8 rotations 12 rotations 16 rotations 3im9rots 2im6rots affine shift 1 1 0.965 0.93 0.895 0.86 0.825 0.79 0.755 0.72 0.685 0.65 0.755 0.72 0.685 0.65 0.755 0.72 0.685 0.65 house_trans_rot1_B40 3im9rots 2im6rots 0.5 affine shift 1 0.965 0.93 0.895 0.86 0.825 0.79 house_trans_rot1_B40_t2 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Fig. 13 Sequence type #3. Comparison between different numbers of knots and the ground truth rotations. Vertical black lines define beginning of frames and vertical magenta defines end of frames. 3im9rots 2im6rots affine shift 1 0.965 0.93 0.895 0.86 0.825 0.79 house_rot1_B40 Fig. 16 Rectification results for sequences of type #4. Different amounts of translation, starting with least translation at top and most translation at the bottom. 3im9rots 2im6rots affine house_trans1_B40 shift 3im9rots 1 0.965 0.93 0.895 0.86 0.825 0.79 0.755 0.72 0.685 0.65 Fig. 14 Rectification results for sequence type #1, Nb = 40. 2im6rots affine shift house_rot2_B40 1 3im9rots 0.965 0.93 0.895 0.86 0.825 0.79 0.755 0.72 0.685 0.65 Fig. 17 Rectification results for sequence type #2. 2im6rots affine 8.3 Stabilisation of Rolling-Shutter Video shift 1 0.965 0.93 0.895 0.86 0.825 0.79 0.755 0.72 0.685 0.65 Fig. 15 Rectification results for sequence type #3. cope with this case, a switching of models can be integrated. The GA method performs better than the GS method on average. This is because the GS method finds the dominant plane in the scene (the house) and rectifies the frame based on this. The ground does not follow this model and thus the rectification there differs substantially from the ground-truth. Worth noting is that if the rolling shutter effects arise from a translation, objects on different depth from the camera will get different amounts of geometrical distortions. For a complete solution one needs to compensate for this locally in the image. Due to occlusions, several frames may also have to be used in the reconstruction. A fair evaluation of video stabilisation is very difficult to make, as the goal is to simultaneously reduce imageplane motions, and to maintain a geometrically correct scene. We have performed a simple evaluation that only computes the amount of image plane motion. In order to see the qualitative difference in preservation of scene geometry, we strongly encourage the reader to also have a look at the supplemental video material. We evaluate image plane motion by comparing all neighbouring frames in a sequence using the error measure in (30). Each frame thus gets a stabilisation score which is the fraction of pixels that were accepted by (30) over the total number of pixels in the frame. An example of this accuracy computation is given in figure 18. The comparisons are run on a video clip from an iPhone 3GS, where the person holding the camera was walking forward, see (Online Resource 01, 2011). The 13 Fig. 18 Stabilisation accuracy on iPhone 3GS sequence. Left to right: Frame #12, frame #13, accepted pixels (in white). 1 1 0.9 0.9 0.8 0.8 0.7 0.7 sequences are zoomed to 135%. This setting produces video with few noticeable border effects, but at the price of discarding a significant amount of the input video. The evaluation plots in figure 19, show that our algorithm is better than Deshaker 2.5 at stabilising image plane motion. iMovie’09, which is not rolling-shutter aware, falls significantly behind. However, we emphasise that this comparison tests for image-plane motion only. It is also our impression that the stabilised video has more of a 3D feel to it after processing with the 3D rotation method, than with Deshaker, see (Online Resource 04, 2011; Online Resource 06, 2011; Online Resource 07, 2011). The reason for this is probably that Deshaker uses an image plane distortion model. 9 Video examples 0.6 0.6 input 3d−rot deshaker input 3d−rot deshaker iMovie Fig. 19 Stabilisation accuracy on iPhone 3GS sequence. Left: stabilisation on original frames. Right: with extrapolation, and zoom to 135%. Diamond centres are median accuracy, tops and bottoms indicate the 25%, and 75% percentiles. walking motion creates noticeable rolling shutter wobble at the end of each footstep. We compare our results with the following stabilisation algorithms: #1 Deshaker v 2.5 (Thalin, 2010). This is a popular and free rolling shutter aware video stabiliser. We have set the rolling shutter amount in Deshaker to 92.52% (See appendix A), and consistently chosen the highest quality settings. #2 iMovie’09 v8.0.6. (Apple Inc., 2010). This is a video stabiliser that is bundled with MacOS X. We have set the video to 4:3, 30fps, and stabilisation with the default zoom amount 135%. The rotation model is optimised over 2 frame windows, with M = 6 knots. The stabilised version using a Gaussian filter with σ = 64 can be seen in (Online Resource 03, 2011) and with extrapolation and 135% zoom in (Online Resource 04, 2011). In figure 19, left we see a comparison with stabilisation in the input grid. This plot shows results from resampling single frames only, borders with unseen pixels remain in these clips (see also figure 18 for an illustration). As no extrapolation is used here, this experiment evaluates stabilisation in isolation. Borders are of similar size in both algorithms, so no algorithm should benefit from border effects. The second experiment, see figure 19, right, shows results with border extrapolation turned on, and all In the supplemental material we show results from three videos captured with three different cell-phones, and our result on Liu et al. (2011b) supplemental video set 1, example 1. In section 8.3 we used a sequence captured with the iPhone 3GS where Online Resource 01 (2011) is the original video, Online Resource 02 (2011) is a 135% zoomed version, Online Resource 03 (2011) rectification and stabilisation with our method, Online Resource 04 (2011) with additional zoom and extrapolation, Online Resource 05 (2011) Deshaker results, Online Resource 06 (2011) Deshaker results with zoom and extrapolation and Online Resource 07 (2011) results from iMovie. The supplemental material also contains videos captured from an HTC Desire at 25.87 Hz, see figure 20 left, and a SonyEricsson Xperia X10 at 28.54 Hz, right. In the HTC Desire sequence the camera is shaken sideways while recording a person, see Online Resource 08 (2011) for the original video. Online Resource 09 (2011) is the result with our rectification and stabilisation method, while Online Resource 10 (2011) is stabilisation without rolling-shutter compensation. Online Resource 11 (2011) and Online Resource 12 (2011) are the results from Deshaker and iMovie respectively. In the SonyEricsson Xperia X10 sequence the person holding the camera is walking while recording sideways, see Online Resource 13 (2011) for the original video. Online Resource 14 (2011) contains the result from our method, Online Resource 15 (2011) the result from Deshaker and Online Resource 16 (2011) the result from iMovie. From these sequences it is our impression that our 3D rotation model is better at preserving the geometry of the scene. We can also see that Deshaker is significantly better than iMovie which is not rolling shutter 14 aware. Especially on Online Resource 15 (2011) we observe good results for Deshaker. Artifacts are mainly visible near the flag poles. Liu et al. (2011a) demonstrated their stabilisation algorithm on rolling shutter video. The result can be seen on the supplemental material website, video set 1, example 1 (Liu et al., 2011b). The algorithm is not rolling shutter aware and distortions are instead treated as noise. The authors report that their algorithm does not handle artifacts like shear introduced by a panning motion but reduces the wobble from camera shake while stabilising the video (Liu et al., 2011a). Our algorithm needs a calibrated camera, but the parameters are unfortunately not available for this sequence. To obtain a useful K matrix, the image centre was used as the projection of the optical centre and different values for the focal length and readout time were tested. As can be seen in Online Resource 17 (2011), the 3D structure of the scene is kept whereas the output of Liu et al. (2011a) still have a a noticeable amount of wobble. The walking up and down motion can still be seen in our result, and this is an artifact of not knowing the correct camera parameters. for more types of camera motion than does mechanical image stabilisation (MIS). In future work we plan to improve our approach by replacing the linear interpolation with a higher order spline defined on the rotation manifold, see e.g. Park and Ravani (1997). Another obvious improvement is to optimise parameters over full sequences. However, we wish to stress that our aim is currently to allow the algorithm to run on mobile platforms, which excludes optimisation over longer frame intervals than the 2-4 that we currently use. For the stabilisation we have used a constant Gaussian filter kernel. For a dynamic video with different kinds of motions, e.g. camera shake and panning, the result would benefit from adaptive filtering. In general, the quality of the reconstruction should benefit from more measurements. In MIS systems, camera rotations are measured by MEMS gyro sensors (Bernstein, 2003). Such sensors are now starting to appear in cell-phones, e.g. the recently introduced iPhone4. It would be interesting to see how such measurements could be combined with measurements from KLT-tracking when rectifying and stabilising video. A Readout Time Calibration Fig. 20 Left: First frame from the HTC Desire input sequence. Right: First frame from the SonyEricsson Xperia X10 input sequence. 10 Concluding Remarks In this article, we have demonstrated rolling-shutter rectification by modelling the camera motion, and shown this to be superior to techniques that model movements in the image plane only. We even saw that image-plane techniques occasionally perform worse than the uncorrected baseline. This is especially true for motions that they do not model, e.g. rotations for the Global shift model (Chun et al., 2008). Our stabilisation method is also better than iMovie and Deshaker when we compare the image plane motion. The main advantage of our stabilisation is that the visual appearance of the result videos and the geometry of the scene is much better. Our model also corrects The setup for calibration of readout time is shown in figure 21, top left. An LED is powered from a function generator set to produce a square pulse. The frequency fo of the oscillation is measured by an oscilloscope. Due to the sequential readout on the RS chip, the acquired images will have horizontal stripes corresponding to on and off periods of the LED, see figure 21, top right. If we measure the period T of the vertical image oscillation in pixels, the readout time can be obtained as: tr = Nr /(T fo ) , (32) where Nr is the number of image rows, and fo is the oscillator frequency, as estimated by the oscilloscope. As can be seen in figure 21 top right, the images obtained in our setup suffer from inhomogeneous illumination, which complicates estimation of T . In their paper, Geyer et al. (2005) suggest removing the lens, in order to get a homogeneous illumination of the sensor. This is difficult to do on cell-phones, and thus we instead recommend to collect a sequence of images of the flashing LED, and then subtract the average image from each of these. This removes most of the shading seen in the input camera image, see figure 21, middle left, for an example. We then proceed by averaging all columns in each image, and removing their DC. An image formed by stacking such rows from a video is shown in figure 21, middle right. We compute Fourier coefficients for this image along the y-axis, and average their magnitudes to obtain a 1D Fourier spectrum, F (u), shown in figure 21, bottom left. N Nr f 1 X 1 X F (u) = f (y, x)e−i2πuy/Nr . Nf Nr x=1 (33) y=1 Here Nf is the number of frames in the sequence. We have consistently used Nf = 40. We refine the strongest peak of (33), 15 resolution 640 × 480 iPhone 4(#1) 1280 × 720 iPhone 4(#2) 1280 × 720 iPhone 4(#1,front) 640 × 480 iTouch 4 1280 × 720 iTouch 4 (front) 640 × 480 HTC Desire 640 × 480 HTC Desire 1280 × 720 SE W890i 320 × 240 SE Xperia X10 640 × 480 SE Xperia X10 800 × 480 f 30 30 30 30 30 30 < 30 < 30 14.706 < 30 < 30 tr 30.84 31.98 32.37 29.99 30.07 31.39 57.84∗ 43.834∗ 60.78 28.386∗ 27.117∗ N 7 6 4 5 5 5 3 4 4 4 5 σ 0.36 0.25 0.27 0.16 0.25 0.07 0.051 0.034 0.16 0.024 0.040 Table 1 Calibrated readout times for a selection of cell-phone cameras with rolling shutters. f (Hz) - frame rate reported by manufacturer. tr (milliseconds) average readout time, N number of estimates, σ standard deviation of tr measurements. Readout times marked by an “*” are variable. See text for details. 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 Camera iPhone 3GS 0 5 10 15 20 0 funded by the CENIIT organisation at Linköping Institute of Technology, and by the Swedish Research foundation. 4 4.5 5 5.5 6 Fig. 21 Calibration of a readout time for a rolling-shutter camera. Top left: Used equipment; an oscilloscope, a function generator and a flashing LED. Top right: One image from the sequence. Middle left: Corresponding image after subtraction of the temporal average. Middle right: Image of average columns for each frame in sequence. Bottom left: 1D spectrum from (33), Bottom right: refinement of peak location. by also evaluating the function for non-integer values of u, see figure 21, bottom right. The oscillation period is found from the frequency maximum u∗ , as T = Nr /u∗ . As the Geyer calibration is a bit awkward (it requires a function generator, an oscilloscope and an LED), we have reproduced the calibration values we obtained for a number of different cellphone cameras in table 1. The “<30” value for frame rate in the table signifies a variable frame rate camera with an upper limit somewhere below 30. The oscillator frequency is only estimated with three significant digits by our oscilloscope, and thus we have averaged obtained readout times for several different oscillator frequencies in order to improve the accuracy. Reported readout times marked with an “*” are one out of several used by the camera. We have seen that a change of focus, or gain may provoke a change in readout time on those cameras. We have found that rectifications using the listed value look better than those from other readout times, and thus it appears that this value is used most of the time. The Deshaker webpage (Thalin, 2010) reports rolling shutter amounts (a = tr × f ) for a number of different rolling shutter cameras. The only cell-phone currently included is the iPhone4, which is reported as a = 0.97 ± 0.02. Converted to a readout time range, this becomes tr ∈ [31.67ms, 33ms] which is the range where we find our two iPhone4 units. Acknowledgements We would like to thank SonyEricsson Research Centre in Lund for providing us with a SonyEricsson Xperia X10, and Anders Nilsson at Computer Engineering for lending us the equipment for readout time calibration. This work is References Ait-Aider O, Berry F (2009) Structure and kinematics triangulation with a rolling shutter stereo rig. In: IEEE International Conference on Computer Vision Ait-Aider O, Bartoli A, Andreff N (2007) Kinematics from lines in a single rolling shutter image. In: CVPR’07, Minneapolis, USA Apple Inc (2010) iMovie’09 video stabilizer. http://www.apple.com/ilife/imovie/ Baker S, Scharstein D, Lewis JP, Roth S, Black MJ, Szeliski R (2007) A database and evaluation methodology for optical flow. In: IEEE International Conference on Computer Vision (ICCV07), Rio de Janeiro, Brazil Baker S, Bennett E, Kang SB, Szeliski R (2010) Removing rolling shutter wobble. In: IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, IEEE, San Francisco, USA Bernstein J (2003) An overview of MEMS inertial sensing technology. Sensors Magazine 2003(1) Buehler C, Bosse M, McMillan L (2001) Non-metric imagebased rendering for video stabilization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’01), pp 609–614 Chang LW, Liang CK, Chen H (2005) Analysis and compensation of rolling shutter distortion for CMOS image sensor arrays. In: International Symposium on Communications (ISCOM05) Cho WH, Kong KS (2007) Affine motion based CMOS distortion analysis and CMOS digital image stabilization. IEEE Transactions on Consumer Electronics 53(3):833–841 Cho WH, Kim DW, Hong KS (2007) CMOS digital image stabilization. IEEE Transactions on Consumer Electronics 53(3):979–986 Chun JB, Jung H, Kyung CM (2008) Suppressing rolling-shutter distortion of CMOS image sensors by motion vector detection. IEEE Transactions on Consumer Electronics 54(4):1479–1487 Forssén PE, Ringaby E (2010) Rectifying rolling shutter video from hand-held devices. In: IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, IEEE, San Francisco, USA 16 Gamal AE, Eltoukhy H (2005) CMOS image sensors. IEEE Circuits and Devices Magazine Geyer C, Meingast M, Sastry S (2005) Geometric models of rolling-shutter cameras. In: 6th OmniVis WS Gramkow C (2001) On averaging rotations. International Journal of Computer Vision 42(1/2):7–16 Green R, Mahler H, Siau J (1983) Television picture stabilizing system. US Patent 4,403,256 Harris CG, Stephens M (1988) A combined corner and edge detector. In: 4th Alvey Vision Conference, pp 147–151 Hartley R, Zisserman A (2000) Multiple View Geometry in Computer Vision. Cambridge University Press Liang CK, Chang LW, Chen H (2008) Analysis and compensation of rolling shutter effect. IEEE Transactions on Image Processing 17(8):1323–1330 Liu F, Gleicher M, Jin H, Agarwala A (2009) Content-preserving warps for 3D video stabilization. In: ACM Transactions on Graphics Liu F, Gleicher M, Wang J, Jin H, Agarwala A (2011a) Subspace video stabilization. ACM Transactions on Graphics 30(1) Liu F, Gleicher M, Wang J, Jin H, Agarwala A (2011b) Subspace video stabilization, supplemental video set 1. http://web.cecs.pdx.edu/∼fliu/project/subspace stabil ization/ Lucas B, Kanade T (1981) An iterative image registration technique with an application to stereo vision. In: IJCAI81, pp 674–679 Morimoto C, Chellappa R (1997) Fast 3D stabilization and mosaic construction. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR97), San Juan, Puerto Rico, pp 660–665 Nicklin SP, Fisher RD, Middleton RH (2007) Rolling shutter image compensation. In: Robocup 2006 LNAI 4434, pp 402–409 Online Resource 01 (2011) iPhone 3GS input sequence, ESM 1.mpg Online Resource 02 (2011) iPhone 3GS 135% zoomed input, ESM 2.mpg Online Resource 03 (2011) iPhone 3GS rectification and stabilisation, ESM 3.mpg Online Resource 04 (2011) iPhone 3GS rectification and stabilisation with 135% zoom and extrapolation, ESM 4.mpg Online Resource 05 (2011) iPhone 3GS Deshaker, ESM 5.mpg Online Resource 06 (2011) iPhone 3GS Deshaker with 135% zoom and extrapolation, ESM 6.mpg Online Resource 07 (2011) iPhone 3GS iMovie with 135% zoom and extrapolation, ESM 7.mpg Online Resource 08 (2011) HTC Desire input sequence, ESM 8.mpg Online Resource 09 (2011) HTC Desire rectification and stabilisation, ESM 9.mpg Online Resource 10 (2011) HTC Desire stabilisation only, ESM 10.mpg Online Resource 11 (2011) HTC Desire Deshaker, ESM 11.mpg Online Resource 12 (2011) HTC Desire iMovie with 135% zoom and extrapolation, ESM 12.mpg Online Resource 13 (2011) SonyEricsson Xperia X10 input sequence, ESM 13.mpg Online Resource 14 (2011) SonyEricsson Xperia X10 rectification and stabilisation, ESM 14.mpg Online Resource 15 (2011) SonyEricsson Xperia X10 Deshaker, ESM 15.mpg Online Resource 16 (2011) SonyEricsson Xperia X10 iMovie with 135% zoom and extrapolation, ESM 16.mpg Online Resource 17 (2011) Rectification and stabilisation results on sequence: Liu et al. 2011 Supplemental Video Set 1, example 1 , ESM 17.mpg Park F, Ravani B (1997) Smooth invariant interpolation of rotations. ACM Transactions on Graphics 16(3):277–295 Ringaby E (2010) Rolling shutter dataset with ground truth. http://www.cvl.isy.liu.se/research/rs-dataset Shi J, Tomasi C (1994) Good features to track. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR’94, Seattle Shoemake K (1985) Animating rotation with quaternion curves. In: Int. Conf. on CGIT, pp 245–254 Thalin G (2010) Deshaker video stabilizer plugin v2.5 for VirtualDub. http://www.guthspot.se/video/deshaker.htm Whyte O, Sivic J, Zisserman A, Ponce J (2010) Non-uniform deblurring for shaken images. In: IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, IEEE, San Francisco, USA Yao YS, Burlina P, Chellappa R, Wu TH (1995) Electronic image stabilization using multiple visual cues. In: International Conference on Image Processing (ICIP’95), Washington DC, USA, pp 191–194 Zhang Z (2000) A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(11):1330–1334

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising