www.ietdl.org Published in IET Computer Vision Received on 30th November 2012 Revised on 20th June 2013 Accepted on 20th June 2013 doi: 10.1049/iet-cvi.2012.0287 ISSN 1751-9632 Depth order estimation for video frames using motion occlusions Guillem Palou, Philippe Salembier Department of Signal Theory and Communications, Technical University of Catalonia (UPC), Barcelona, Spain E-mail: guillem.palou@upc.edu Abstract: This study proposes a system to estimate the depth order of regions belonging to a monocular image sequence. For each frame, the regions are ordered according to their relative depth using information from the previous and following frames. The algorithm estimates occlusions relying on a hierarchical region-based representation of the image by means of a binary tree. This representation is used to deﬁne the ﬁnal depth order partition which is obtained through an energy minimisation process. Finally, to achieve a global and consistent depth ordering, a depth order graph is constructed and used to eliminate contradictory local cues. The system is evaluated and compared with the state-of-the-art ﬁgure/ground labelling systems showing very good results. 1 Introduction Depth perception in human vision relies on several depth cues. For close objects, humans accurately estimate depth making use of both eyes and inferring disparity between two views. However, when objects are distant or when only one viewpoint is available, it is also possible to partially estimate the scene structure through the so-called ‘monocular depth cues’. In static images, T-junctions or convexity cues are classical depth cues. In video sequences, motion information can also be used to obtain depth information. For example, occlusion of moving objects, size change or motion parallax are used to structure the scene [1]. Nowadays, motivated by the ﬁlm industry, many research works are focusing on depth maps generation. Most approaches make use of several viewpoints to compute disparity as it offers a reliable cue for depth estimation [2]. However, disparity estimation assumes that two images captured at the same time instant are available but, in many situations, this assumption cannot be fulﬁlled. For example, most commercial cameras have only one photographic lens and only record monocular sequences. Moreover, one critical issue is the large amount of material which has already been acquired in the past as monocular sequences and which needs to be converted to some extent to a three-dimensional (3D) format. In such cases, depth information can only be inferred through monocular cues. The ﬁlm industry is seriously tackling this problem. For example, Disney or Microsoft have designed supervised systems supporting the creation of depth maps for monocular sequences [3, 4]. These systems rely heavily on human interaction. However, there is a clear interest in deﬁning unsupervised systems because of their reduced cost in time and money resources [5–7]. Depth order maps can be seen as an intermediate state between 2D images where no depth information is deﬁned 152 & The Institution of Engineering and Technology 2014 and full 3D maps. The depth order map speciﬁes an image partition where regions are ordered by their relative depth. State-of-the-art depth ordering systems include [5, 7] in which a layered representation of a sequence is obtained by ﬁnding occlusions between a pair of regions. However, the ﬁnal depth order is obtained by a simple aggregation of local cues with no global reasoning. As a result, the ﬁnal map is not globally consistent. In [8], a global depth order is obtained through the estimation of 3D movements. The approach processes pixels individually and lacks the concept of regions. Therefore the resulting partitions involve many small regions and the decision process is not robust. Karsch et al. [6] attempt to ﬁnd a full depth map by matching parts of the input video to similar videos and then by propagating depth information to unmatched regions. This approach works well for known scenes but its generalisation to arbitrary scenes is very difﬁcult. There is an attempt in [9, 10] to retrieve a full depth map from a monocular image sequence. However, they involve important assumptions and restrictions about the scene structure which may not be fulﬁlled in many typical situations. Other state-of-the-art systems do not try to create a depth partition but focus on the estimation of the depth order around contours. In this context, the contours may not be closed and therefore do not specify regions. For example, assuming that the scene is still and the occlusions are because of disparity, the contours are detected in [11]. Interesting detection results are shown but, if relative depth is needed, another approach should be followed. Sundberg et al. [12] deﬁnes a ﬁgure/ground (f/g) labelling on occlusion contours by computing the motion boundaries and assigning the closer (ﬁgure) side to the region that moves similar to the contour. The main drawback of this scheme is that the relative depth is assigned based on a set of local characteristics of the contour and avoids global IET Comput. Vis., 2014, Vol. 8, Iss. 2, pp. 152–160 doi: 10.1049/iet-cvi.2012.0287 www.ietdl.org reasoning on the depth structure of the scene. However, the f/ g labels are attractive because they offer a good way to compare systems by taking into account the correct number of labelled contour pixels and an objective evaluation methodology is deﬁned in [12]. The system proposed here addresses the main problems of the state-of-the-art solutions dealing with depth order estimation in video sequences. The three main challenges we address are: to allow moving objects to be present in the scene, to provide a complete depth order partition in contrast to deﬁning depth order only on occlusion boundaries and to ensure that this partition is globally coherent. Our approach is to use motion occlusion to determine the depth order within a frame given its previous and next frames. We make no assumptions on scene stillness. Therefore the main cue we can rely on is motion occlusion. When the objects move relative to the camera, background areas may appear and disappear, providing a reliable cue to determine the depth order. Note that motion occlusion appears when the apparent motion of two overlapping objects/regions is different. This situation occurs either when: pruning’ block). Since occlusion relations provide depth relations between a pair of regions, a ﬁnal step is needed to ensure global consistency and to obtain a ﬁnal depth order map. Besides the algorithm deﬁnition, this work’s contributions concern the formalisation of the energy minimisation originally presented in [15] as an efﬁcient way to retrieve partitions from BPTs and the study of motion occlusions as a reliable cue for depth ordering on video frames, showing that dynamic cues perform better than static ones [16]. The paper is organised as follows: Section 2 deﬁnes the optical ﬂows used in the system. An overview on the hierarchical segmentation tools used is explained in Section 3 whereas Section 4 discusses the speciﬁc graph cut technique, called pruning, used to extract an optimal partition from the trees. The motion occlusion estimation is presented in Section 5. Finally, the deﬁnition of the partition involving the regions to be ordered and the global reasoning leading to a complete depth order map are detailed in Section 6. The evaluation of the prosed scheme is performed in Section 7. Finally, Section 8 concludes the paper. † The real motion of the two objects is different (e.g. two cars in a road). † The scene is static and the object depths are different (e.g. a building occluding the sky). 2 To exploit this idea, the system ﬁrst computes the forward and backward optical ﬂows (‘optical ﬂow estimation’ block of Fig. 1). Then, a hierarchical region-based representation of the image is computed and stored in a binary partition tree, BPT (‘tree construction’ block). The goal of this representation is to support robust estimation and global reasoning about relative depth. The use of such representation is essential in our approach. In this paper, we will use and compare two ways to construct this representation: one based on colour, shape and motion features (CSM) [13] and one based on ultrametric contour map (UCM) [14]. The created BPT is used to retrieve two partitions using speciﬁc graph cut techniques called ‘pruning’. The ﬁrst partition allows us to ﬁt parametric ﬂow models to regions, ﬁnding reliable ﬂow values at occlusion points (‘parametric ﬂow ﬁtting pruning’ block) and then obtaining occlusion relations. The second partition is obtained by exploiting these occlusion relations and deﬁnes regions which can be depth ordered (‘depth ordering Optical flows To determine the depth order of a frame It, the previous and following frames It−1, It+1 are used. The forward ﬂows wt−1, t, wt, t + 1 and backward ﬂows wt, t−1, wt+1, t (see Fig. 2) are estimated using the technique presented in [17]. This is a classical motion estimation algorithm which provides good results with a reduced computational load. The optical ﬂow wta , tb maps each pixel of Ita to one pixel in Itb . The ﬂows wt, t + 1 and wt, t−1 are used with colour information to create the BPT (Section 3). The two remaining ﬂows are also used to estimate the occlusions (Section 5). Let us discuss now the construction of the BPT. 3 BPT to represent hierarchical segmentations The algorithm developed in this paper relies on a hierarchical region-based representation of the image using a binary tree structure. A classical way to build such a BPT is using a bottom-up region merging technique. A region tree structure is attractive as it allows a more global and robust image interpretation to be performed compared with the original pixel-based representation. Moreover, the Fig. 1 Proposed system: three consecutive frames are used to estimate a depth order map System involves an optical ﬂow estimation step and a tree construction Then, two pruning (graph cut) strategies are applied to extract one partition providing a region-based representation of the optical ﬂow and a second partition involving regions which can be depth ordered Finally, a global reasoning is used to deﬁne a consistent depth order map IET Comput. Vis., 2014, Vol. 8, Iss. 2, pp. 152–160 doi: 10.1049/iet-cvi.2012.0287 153 & The Institution of Engineering and Technology 2014 www.ietdl.org Fig. 2 Top row: three consecutive frames It − 1, It (outlined) and It + 1 Bottom row: from left to right, wt−1, t, wt, t−1 , wt, t+1 and wt+1, t ﬂows representation is multi-scale and small details as well as very large areas are described by the tree. Note that arbitrary trees could be used, but we restrict ourselves to the binary case for two reasons: (i) binary trees allow a ﬁne control of the image under/over segmentation and (ii) pruning algorithms on these kinds of trees are easier to deﬁne and to handle than arbitrary trees. The BPT construction begins with an initial partition of the image and iteratively merges a pair of neighbouring regions until only one region is left. The merging order is deﬁned by a region distance describing the similarity between two regions. In the case of static images, the distance is usually a combination of similarity measures relying on simple characteristics such as colour, area, shape or contour strength. In the video case, using motion ﬂows, it is also possible to differentiate between regions of similar colour which move in different directions. In any case, the resulting BPT is composed of nodes representing image regions and edges describing the inclusion relationship between regions, whereas the leaves represent regions belonging to the initial partition. Any oversegmentation can be used as initial partition. In the following, we will assume that the initial partition is made of regions involving only one pixel. For the purposes of this work, two possible trees have been considered: the BPTCSM created using [18] and the BPTUCM using the UCM of [14]. The only difference in the construction of the two BPTs is region distance d(Ri, Rj ). The BPTCSM uses a combination of colour, shape and motion information, whereas the BPTUCM considers the mean strength of the common contour between Ri and Rj. The formal expressions are dBPTCSM Ri , Rj = da adcm + (1 − a)ds gPb(x) dBPTUCM Ri , Rj = |Gij | x[G (1) (2) ij For the BPTCSM distance, ds is the shape distance deﬁned as in [19], da is a logarithmic weight of the area as deﬁned in [16]. dcm stands for a distance measuring the region similarity in terms of colour and motion. Essentially, each region is represented by a limited number of dominant colours and motion vectors and the Earth Mover’s distance 154 & The Institution of Engineering and Technology 2014 is used to compare the descriptions of two regions. See [13] for more details. For the BPTUCM distance, Γij is the common region contour and gPb is the contour detector of [14]. Once the BPT has been constructed, it can be used to retrieve many different partitions. The next section discusses this point. 4 Optimum tree pruning Independent of the distance used to create the tree, the technique extracting a partition from it can be viewed as a ‘pruning’ [15, 19, 20]. The BPT is a particular graph where each node represents a region and the tree branches the region inclusion relationship. A partition can be naturally deﬁned from a BPT by selecting the regions represented by the tree leaves. If this is done on the original tree, the initial partition where each region involves only one pixel is extracted. However, if we prune the tree, that is, if we cut branches at one location to reduce their length, a new tree, called a ‘pruned BPT’ is created. The leaves of the pruned BPT deﬁne a non trivial partition. This pruning is a particular graph cut: if the tree root is the ‘source’ of the graph and the leaves are connected to a ‘sink’ node, the pruning cuts the tree in two connected components, one including the source and the other the sink. Note that following this approach, partitions observed during the merging sequence can obviously be obtained but the interest of the pruning is that a much richer set of partitions can be extracted. Of course, the key point is to deﬁne an appropriate pruning rule. Here, an optimum pruning based on energy minimisation is proposed. A partition P extracted by pruning can be represented by a ‘partition vector’ x of binary variables xi = {0, 1} with i = 1, ..., N assigned to each BPT region Ri. If xi = 1, Ri belongs to the partition, otherwise xi = 0. Only a reduced subset of vectors, called ‘valid’ vectors, actually represents a partition extracted by pruning. A vector x is valid if one and only one region in every BPT branch involves only one xi = 1. A branch is a sequence of regions from a leaf to the root of the tree. For example, the tree of Fig. 3 involves four branches. Each branch l can be represented by a ‘branch vector’ ` bl = bl1 , . . . , blN where bli = 1 if region Ri is in the branch and bi = 0 otherwise. In the example of Fig. 3, the IET Comput. Vis., 2014, Vol. 8, Iss. 2, pp. 152–160 doi: 10.1049/iet-cvi.2012.0287 www.ietdl.org Fig. 3 Left: set of valid partition vectors representing a pruning and an invalid partition vector Centre: BPT with light grey nodes indicating the cut described by x3 Right: BPT with grey nodes representing the regions described by xI which does not deﬁne a pruning four branch vectors are: b1 = (1, 0, 0, 0, 1, 0, 1)` , b2 = (0, 1, 0, 0, 1, 0, 1)` , b3 = (0, 0, 1, 0, 0, 1, 1)` and b4 = (0, 0, 0, 1, 0, 1, 1)` . With this notation, a partition vector x is valid if, for every branch l, b` l x = 1. In Fig. 3, xI = (1, 1, 0, 0, 1, 0, 0)T is not valid because b` l xI = 2. The constraint can be globally expressed as a matrix product Ax. In the case of Fig. 3, the constraint is ⎛ b` 1 ⎞ ⎛ 1 0 ⎜ `⎟ ⎜ ⎜ b4 ⎟ ⎜0 1 ⎜ ⎟ Ax = ⎜ ` ⎟x = 1⎜ ⎜0 0 ⎝ b3 ⎠ ⎝ b` 4 0 0 1 ⎞ 0 0 1 0 0 0 1 0 1 0 0 1 ⎟ 1⎟ ⎟x = 1 (3) 1⎟ ⎠ 0 1 0 1 1 where 1 is a vector containing all ones. An efﬁcient way to extract a partition from the BPT is to ﬁnd the one that minimises an energy function of the type ∗ x = arg min E(x) = arg min x s.t. Ax = 1 x Er Ri xi (4) Ri [BPT xi = {0, 1} (5) where Er(Ri) is a function which only depends of the internal characteristics of Ri. In that case, the optimum partition x* can be efﬁciently found by the dynamic programming Algorithm 1 (see Fig. 4). The algorithm beneﬁts from the fact that the energy Er(Ri) does not dependent on regions Rj≠i and that the global energy is the sum of the energy values assessed on each region. Therefore locally optimum decisions lead to a global optimum. More precisely, if Ri is a region which has two child regions Rl and Rr, the local decision which has to be taken is to know whether Ri or Rl < Rr has to belong to the partition as both solutions cover the same image area. If Er(Ri) is smaller (larger) thanEr(Rl) +Er(Rr), the locally optimum solution selects Ri Rl < Rr . The complete tree is analysed in a bottom-up fashion (from the leaves to the root) to deﬁne the complete partition as outlined in Algorithm 1 (see Fig. 4). This algorithm is going to be used twice in the proposed system. Once for the identiﬁcation of occluded and disoccluded areas and once for the extraction of regions to be depth ordered. The ﬁrst issue is discussed in the next section. 5 5.1 Estimation of occlusion relations Occluded and disoccluded areas As discussed in the introduction, motion occlusion is used here as the basis of depth order estimation. Using three frames It−1, It, It+1, it is possible to detect pixels becoming invisible from It to It+1, called ‘occluded pixels’ and pixels becoming invisible from It to It−1 called ‘disoccluded pixels’. Here, we describe the detection of occluded pixels as the detection of disoccluded pixels can be performed similarly by working on the past frame It−1 instead of the next frame It+1. When there is no occlusion, the optical ﬂow wt, t + 1 creates locally a bijection between It and It+1. However, in case of occlusion, two different pixels from It, pa and pa, are projected at the same location pm in It+1. This situation is illustrated in the left part of Fig. 5. Therefore an occlusion is detected if pa + wt, t+1 pa = pb + wt, t+1 pb = pm assuming that pa = pb Fig. 4 Algorithm 1 Optimal partition selection: OPTIMALSUBTREE (region Ri) contains the set of regions belonging to the subtree rooted at Ri which have been selected to be part of the partition and the sum of their associated energy IET Comput. Vis., 2014, Vol. 8, Iss. 2, pp. 152–160 doi: 10.1049/iet-cvi.2012.0287 (6) This equation explains that either pa or pb is an occluded pixel. To decide which one is the actual occluded pixel, we rely on a comparison of patches centered around pa, pb and pm. Indeed, it is likely that the patch around the non-occluded pixel (pb in Fig. 5) is very similar to the 155 & The Institution of Engineering and Technology 2014 www.ietdl.org Fig. 5 Left: detection of occluded pixels (black area) Right: detection of occluding pixels (white area) In both cases, the image on the left (right) is It (It+1) patch centred around its projected point (pm). Therefore the decision is based on the distance between patches 2 It+1 pm + d − It px + d D px , pm = (7) d[G with px = pa or pb and Γ is a 5 × 5 square window. The pixel with highest D( px, pm) is declared to be the occluded pixel. Following this strategy, all pixels belonging to the black area of Fig. 5 are deﬁned as occluded pixels. An example on a real image can be seen in the central image of Fig. 6 where occluded pixels are shown. Once the occluded pixels have been deﬁned, we need to ﬁnd the ‘occluding pixels’, that are the pixels which will cover the occluded pixels in the next frame. Indeed, it is the relation between occluded and occluding pixels that provides a depth cue. However, as can be seen in Fig. 5, the optical ﬂow associated with occluded pixels (Pa) is particularly unreliable. To deal with this issue, the previously created BPT is used to deﬁne an optical ﬂow partition Pf where a parametric motion model is assigned to each region allowing us to obtain a reliable ﬂow for occluded pixels. Of course, a similar detection has to be performed for disoccluded and disoccluding pixels. 5.2 where (x, y) ∈ R. The a1, …, a8 parameters are estimated with robust regression using iterative least squares [22] 2 wt,Riq (p) = arg min C wt, q (p) − wt, q (p) (10) wt,q p=(x, y)[Ri √ with the robust penaliser C(z) = z2 + e2 with e ≪ 1. An example of ﬂow ﬁtting can be seen in the right part of Fig. 6. To limit the computational load, this ﬂow ﬁtting is applied on the tree nodes that are close to the tree root. Typically, the nodes corresponding to the last thousand merging steps are kept and the remaining nodes corresponding to earlier merging steps are discarded. Once the parametric ﬂow is estimated, a partition Pf representing the regions that best ﬁt to these models is computed using the optimal pruning Algorithm 1 (see Fig. 4) with the following energy Er(Ri) t, q wt,Riq (x, y) + lf Er Ri = w (x, y) − (11) q=t+1 x, y[Ri The constant λf = 4 × 103 is used to prevent oversegmentation. It was found experimentally and proved not to be crucial for overall system performance. Pruning for parametric ﬂow ﬁtting To obtain a region-based modelling of the optical ﬂow, a parametric projective model [21] is used. The ﬂows wt,Riq = ( u, v) with q = t ± 1, associated with region Ri can be expressed as a quadratic model on the x and y coordinates u(x, y) = a1 + a2 x + a3 y + a7 x2 + a8 xy (8) v(x, y) = a4 + a5 x + a6 y + a7 xy + a8 y2 (9) 5.3 Occlusion relation estimation Once the partition Pf has been deﬁned and a parametric optical ﬂow model is available for each region, the occluding pixels can be deﬁned by projecting the occluded pixels in It+1 with wt, t+1 and by going back to the current frame following the backward ﬂow wt+1,t. This is illustrated in the right part of Fig. 5 where occluding pixels appear in the white area. Hence, for each occluded pixel pu, the Fig. 6 Example occlusion estimation for two regions From left to right: original frame with the region contours in white Frame with occluded and occluding pixels Forward (top) and backward (bottom) estimated and modelled ﬂows 156 & The Institution of Engineering and Technology 2014 IET Comput. Vis., 2014, Vol. 8, Iss. 2, pp. 152–160 doi: 10.1049/iet-cvi.2012.0287 www.ietdl.org corresponding occluding pixel po is given by po = pu + wt,Rit+1 pu + wt+1, t pu + wt,Rit+1 pu (12) The central image of Fig. 6 also shows these occluding pixels. At this point, we know that the occluding pixels are in front of the occluded pixels and similarly, the disoccluding pixels are in front of the disoccluded pixels. This information is used in the second BPT pruning described in the following section. 6 6.1 Depth order map definition Depth ordering pruning Equation (12) creates a set of pixel pairs (pu, po) for which depth information is available. If both pixels belong to the same region, they are discarded but if they belong to two different regions, we can conclude that there is one evidence that the two regions belong to different depth planes. In the context of regions described by the BPT, if we deal with regions that are close to the root, many (pu, po) pairs are discarded because the regions are very large. By contrast, if the regions are close to the leaves, many (pu, po) pairs will be preserved. To extract from the BPT a partition Pd involving regions which can be depth ordered, an optimal pruning is used. Here, the energy to be optimised should be a compromise between the number of occlusion relations, that is of (pu, po) pairs, that are kept and the simplicity of the partition in terms of region number. As a result, the pruning is performed with Algorithm 1 (see Fig. 4) with the following energy Er Ri = (pu , po )[Ri 1 + lo No (13) where No is the total number of estimated occlusion relations. To avoid oversegmented solutions, λo = 4 × 10−3 is used (see Section 7). 6.2 Final depth ordering Once the ﬁnal partition Pd is obtained through BPT pruning, a global ordering can be computed. The problem could be viewed as a rank aggregation problem which is used for web ranking [23] or photosequencing [24]. Here, the goal is to achieve a fully ordered list from a set of partial orders by minimising a given cost function. Normally, rank aggregation works with fully ordered lists, where two elements cannot have the same order. Since, in an image, two different regions may be at the same depth (thus have the same order), we state the problem as a network reliability problem [25]. A graph G = (V, E) is constructed where the vertices V represent the regions of Pd. A directed edge ei = (a, b, pi) is deﬁned between node a and node b if there are occlusion relations between region Ra and region Rb. The weight of ei is pi = Nab/No where Nab is the number of pixels from Ra which have been estimated as occluding pixels of Rb and No is the total number of occluding pixels. The graph G can be seen as a network of (un)reliable links, with the edge ei = (a, b, pi) connecting a and b with probability pi. In this context, a precedes b in depth (a is in front of b) with probability pi. For two arbitrary nodes of G, the probability of precedence (PoP) can be computed even if there are no edges directly connecting them. If there exists more than one path from a node a to b, the probability of ‘a to precede b’ is called ρab and is the probability that at least one path between a and b is reliable. ρab can be computed by complete state enumeration and the inclusion–exclusion principle [25]. To deﬁne a globally consistent depth order between regions, G should be acyclic. To break cycles in G (if any), the algorithm iteratively eliminates the edge of minimum PoP. Once all cycles have been removed in G, a topological partial sort [26] is applied and each region is assigned a depth order. Regions which have no depth relation, are assigned the depth of their most similar adjacent region according to the distance in the BPT construction. The complete process is illustrated in Fig. 7 with a simple example. 7 Results System evaluation is performed on keyframes of several classical sequences. In order to obtain an objective evaluation, we propose two classes of experiments: Figure/ground assignment: We assess the foreground/ background (f/g) label assignment on contours as discussed in [11, 12]. In our context, the assignment is performed as Fig. 7 Depth ordering example a Depth order partition: top: region number and contours, bottom: estimated occluded points and occluding points b Initial G graph with all occlusion relations c Final graph where cycles have been removed. Removed edges are dashed d Depth order map (brighter regions are closer to the viewer) IET Comput. Vis., 2014, Vol. 8, Iss. 2, pp. 152–160 doi: 10.1049/iet-cvi.2012.0287 157 & The Institution of Engineering and Technology 2014 www.ietdl.org follows: when two depth planes meet, the part of the contour belonging to the closest region is assigned the foreground label and the other side of the contour is assigned the background label, see Fig. 8. It is important to note that the proposed system deﬁnes the depth information on a region basis, whereas the f/g algorithm of [12] only labels contour points. These contours are not necessarily closed and therefore no regions are deﬁned and these f/g algorithms do not allow the creation of a complete depth order map. Nevertheless, the existence of a ground truth f/g database makes the comparison with these systems attractive. The used datasets are the Carnegie Mellon Dataset (CMU) [27] and the Berkeley Dataset (BDS) [12]. We follow the same evaluation procedure as [12] which essentially is the precision of f/g labels on matched contour pixels against a ground truth database containing depth order partitions. Segmentation evaluation: Equation (13) establishes a region energy depending on a factor λo which has a direct effect on the granularity of the extracted partition Pd. For large values of λo, only prominent occlusion relations will be kept and thus only a few regions are conserved. On the contrary, for small values λo, the generated Pd also preserves low-conﬁdent occlusion relations, generating a partition with more regions. If Pd only includes regions corresponding to highly conﬁdent occlusion relations, it is expected to obtain a high f/g precision rate, at the expense of a low boundary recall (BR) on groundtruth segmentations. If Pd is formed by regions corresponding to low-conﬁdent occlusion relations, the BR is expected to improve, although the f/g precision can decrease. To this end, jointly with the f/g precision, we present the BR of the given algorithm. Table 1 shows the performances of f/g assignment and BR on the CMU and BDS datasets for the BPTCSM and BPTCMU trees and compares it with the techniques proposed in [12, 16]. For [12], we only report the paper published f/g results as it was impossible to reproduce the complete algorithm. Therefore the BR for this technique is not available. For the remaining techniques, we have used λo values providing a similar BR to make a fair evaluation of the f/g results. From Table 1 Percentage of correct f/g assignments and BR on the CMU and BDS datasets Dataset CMU [27] figure/ground from optical flow [12] still image depth ordering [16] depth ordering with BPT depth ordering with UCM BDS [12] f/g, % BR f/g, % BR 83.8 63.4 88.0 67.9 – 0.5 0.48 0.48 68.6 63.6 80.9 68.9 – 0.4 0.37 0.37 all the presented techniques, the BPTCSM is the one with the best performance on f/g assignment, outperforming [12] on both datasets. The one performing worst is [16] mainly because it does not use motion cues at all and its depth ordering is based only on monocular static cues (T-junctions and convexity). BPTUCM has lower performances than BPTCSM. Although UCMs have excellent performances in terms of deﬁning distances between regions in static images, they do not involve motion features. The effects of introducing motion information in BPT construction can be seen in Fig. 9 on images with various objects of very similar colours. In cases where the colour information is ambiguous, motion successfully helps to identify regions moving coherently. As stated in [28], prominent contours are easy to detect and to assign the correct depth gradient, whereas ambiguous contours are much more difﬁcult to deal with. Therefore it is expected that, as the BR increases, the f/g assignment loses performance. This can be seen in the table reported in Fig. 9. These results are obtained on the BDS dataset, which we found more challenging than the CMU dataset, by varying λo on the BPTCSM and BPTUCM techniques. Depending on the application, one can set a high λo and let the system behave like a foreground/background segregation system with high f/g performance. If more complex scenes have to be processed, a low λo retrieves multiple regions although f/g assignment is less precise. A subjective evaluation of the ability of the proposed system to create a depth order map can be seen in Figs. 8 Fig. 8 Results of the CMU dataset and f/g assignment with the BPTCSM system From left to right (on both columns) (1) Processed keyframe (2) Occlusion relations (3) Estimated depth partition. White regions are closer and black regions are further (4) f/g assignment on contours 158 & The Institution of Engineering and Technology 2014 IET Comput. Vis., 2014, Vol. 8, Iss. 2, pp. 152–160 doi: 10.1049/iet-cvi.2012.0287 www.ietdl.org Fig. 9 Left: comparison of BPTCSM and BPTUCM for images with objects of similar colour: original frames shown on the left, the centre column shows the results of BPTCSM and the right column shows the results of BPTUCM Rightmost table shows the f/g assignment and the BR varying the parameter λo Fig. 10 Results on a subset of the BDS dataset with the BPTCSM system For each column, the right image corresponds to the analysed frame with f/g assignment overlaid on contours Left image corresponds to the ﬁnal depth order partition and 10 showing that motion occlusions may work over a variety of situations: static scenes, moving foregrounds, moving backgrounds or even multiple moving objects. 8 Conclusions In this work, a system inferring the relative depth order of the regions of a frame has been described. Combining a variational approach for optical ﬂow estimation and a hierarchical region-based representation of the image, we IET Comput. Vis., 2014, Vol. 8, Iss. 2, pp. 152–160 doi: 10.1049/iet-cvi.2012.0287 have developed a reliable system to detect occlusion relations and to create depth order partitions using only motion occlusion. The system also allows us to deal with the classical foreground/background contour labelling problem (f/g). In this context, comparison with the state-of-the-art shows that motion occlusions are very reliable cues. The presented approach, although using only motion information to detect boundaries, achieves better results on f/g assignment than the state-of-the-art technique [12]. 159 & The Institution of Engineering and Technology 2014 www.ietdl.org Many extensions of the system are possible. First, a longer temporal window could be used to retrieve more precisely motion occlusions. Secondly, we can take advantage of other monocular depth cues, such as T-junctions and convexity to help in case of motionless depth relations. Although Table 1 shows that they are less reliable than motion occlusions, they could be useful when motion occlusions are not present (i.e. a static background), as in some cases in Fig. 10. We believe also that motion occlusions can be propagated throughout the sequence to infer a consistent depth order. Sequence depth ordering seems plausible because results on individual frames are promising. 9 References 1 Ono, M.E., Rivest, J., Ono, H.: ‘Depth perception as a function of motion parallax and absolute-distance information’, J. Exp. Psychol., Hum. Percept. Perform., 1986, 12, pp. 331–337 2 Qian, N., Qian, D.N.: ‘Binocular disparity and the perception of depth’, Neuron, 1997, 18, pp. 359–368 3 Ward, B., Bing Kang, S., Bennett, E.P.: ‘Depth director: a system for adding depth to movies’, IEEE Comput. Graph. Appl., 2011, 31, (1), pp. 36–48 4 Wang, O., Lang, M., Frei, M., Hornung, A., Smolic, A., Gross, M.: ‘StereoBrush: interactive 2D to 3D conversion using discontinuous warps’. Proc. Eighth Eurographics Symp. on Sketch-Based Interfaces and Modeling (SBIM’11), New York, NY, USA, 2011, pp. 47–54 5 Bergen, L., Meyer, F.: ‘A novel approach to depth ordering in monocular image sequences’. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2000, vol. 2, pp. 536–541 6 Karsch, K., Liu, C., Kang, S.B.: ‘Depth extraction from video using nonparametric sampling’. ECCV, 2012 7 Turetken, E., Alatan, A.A.: ‘Temporally consistent layer depth ordering via pixel voting for pseudo 3D representation’. 3DTV Conf., 2009, pp. 1–4 8 Chang, J.-Y., Cheng, C.-C., Chien, S.-Y., Chen, L.-G.: ‘Relative depth layer extraction for monoscopic video by use of multidimensional ﬁlter’. Proc. IEEE Int Multimedia and Expo Conf., 2006, pp. 221–224 9 Li, P., Farin, D., Gunnewiek, R.K., de With, P.H.N.: ‘On creating depth maps from monoscopic video using structure from motion’. Proc. 27th Symp. on Information Theory in the Benelux, 2006, pp. 508–515 10 Zhang, G., Jia, J., Wong, T.-T., Bao, H.: ‘Consistent depth maps recovery from a video sequence’, IEEE Trans. Pattern Anal. Mach. Intell., 2009, 31, (6), pp. 974–988 11 He, X., Yuille, A.: ‘Occlusion boundary detection using pseudo-depth’. ECCV, 2010 (LNCS, 6314), pp. 539–552 160 & The Institution of Engineering and Technology 2014 12 Sundberg, P., Brox, T., Maire, M., Arbelaez, P., Malik, J.: ‘Occlusion boundary detection and ﬁgure/ground assignment from optical ﬂow’. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Washington, DC, USA, 2011, pp. 2233–2240 13 Palou, G., Salembier, P.: ‘Depth ordering on image sequences using motion occlusions’. Proc. 19th IEEE Int. Conf. Image Processing, Florida, USA, September 2012, pp. 1217–1220 14 Arbeláez, P., Maire, M., Fowlkes, C., Malik, J.: ‘Contour detection and hierarchical image segmentation’, IEEE Trans. Pattern Anal. Mach. Intell., 2011, 33, (5), pp. 898–916 15 Salembier, P., Garrido, L.: ‘Binary partition tree as an efﬁcient representation for image processing, segmentation, and information retrieval’, IEEE Trans. Image Process., 2000, 9, (4), pp. 561–576 16 Palou, G., Salembier, P.: ‘Monocular depth ordering using T-junctions and convexity occlusion cues’, IEEE Trans. Image Process., 2013, 22, (5), pp. 1926–1939 17 Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: ‘High accuracy optical ﬂow estimation based on a theory for warping’. European Conf. Computer Vision, Prague, Czech Republic, May 2004, vol. 3024, pp. 25–36 18 Palou, G., Salembier, P.: ‘2.1 depth estimation of frames in image sequences using motion occlusions’, in Fusiello, A., Murino, V., Cucchiara, R. (Eds.): ‘ECCV Workshops’ (Springer, 2012) (LNCS, 7585), pp. 516–525 19 Vilaplana, V., Marques, F., Salembier, P.: ‘Binary partition trees for object detection’, IEEE Trans. Image Process., 2008, 17, (11), pp. 2201–2216 20 Calderero, F., Marques, F.: ‘Region merging techniques using information theory statistical measures’, IEEE Trans. Image Process., 2010, 19, (6), pp. 1567–1586 21 Kanatani, K.: ‘Transformation of optical ﬂow by camera rotation’, IEEE Trans. Pattern Anal. Mach. Intell., 1988, 10, (2), pp. 131–143 22 Andersen, R.: ‘Modern methods for robust regression. Number 152 in quantitative applications in the social sciences’ (Sage Publications, 2008) 23 Dwork, C., Kumar, R., Naor, M., Sivakumar, D.: ‘Rank aggregation methods for the web’. Proc. 10th Int. Conf. World Wide Web (WWW’01), New York, NY, USA, 2001, pp. 613–622 24 Basha, T., Moses, Y., Avidan, S.: ‘Photo sequencing’, in Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (Eds.): ‘ECCV’ (Springer Berlin, Heidelberg, 2012) (LNCS, 7577), pp. 654–667 25 Terruggia, R.: ‘Reliability analysis of probabilistic networks’. PhD thesis, Universita degli Studi di Torino, 2010 26 Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: ‘Introduction to algorithms’ (MIT Press, 2001, 2nd edn.) 27 Stein, A.N., Hebert, M.: ‘Occlusion boundaries from motion: low-level detection and mid-level reasoning’, IJCV, 2009, 82, (3), pp. 325–357 28 Maire, M.R.: ‘Contour detection and image segmentation’. PhD thesis, University of California, Berkeley, 2009 IET Comput. Vis., 2014, Vol. 8, Iss. 2, pp. 152–160 doi: 10.1049/iet-cvi.2012.0287

Download PDF

- Similar pages