INAUGURAL - DISSERTATION zur Erlangung der Doktorw¨ urde

INAUGURAL - DISSERTATION zur Erlangung der Doktorw¨ urde
INAUGURAL - DISSERTATION
zur
Erlangung der Doktorwürde
der
Naturwissenschaftlich - Mathematischen
Gesamtfakultät
der Ruprecht - Karls - Universität
Heidelberg
vorgelegt von
Dipl.-Phys. Sven Wanner
aus Sindelfingen
Tag der mündlichen Prüfung: 06.02.2014
Orientation Analysis
in 4D Light Fields
Gutachter:
Prof. Dr. Bernd Jähne
PD. Dr. Christoph Garbe
Zusammenfassung
Die vorliegende Arbeit beschäftigt sich mit der Analyse von 4D Lichtfeldern. Als Lichtfeld bezeichnen wir in diesem Zusammenhang eine Serie von digitalen 2D Bilder einer
Szene die auf einem planaren regulären Gitter von Kamerapositionen aufgenommen
werden. Essenziell ist dabei die Aufnahme einer Szene mittels vieler Kamerapositionen
konstanten Abstandes zueinander. Dadurch werden die von einem Punkt der Szene
ausgehenden Lichtstrahlen als Funktion der Kameraposition abgetastet. Dadurch ergibt
sich die bereits erwähnte Vierdimensionalität der Daten da, im Gegensatz zu einem
klassischen Bild, zusätzlich zur Ortsinformation eine Richtungsinformation der Lichtintensität abgebildet wird.
Lichtfelder sind ein relativ neues Forschungsfeld für die Bildverarbeitung, deren moderner
Ursprung eher in der Computergrafik zu suchen ist. Dort wurden sie verwendet, um
die aufwendige Modellierung der 3D Geometrie zu umgehen und mittels Interpolation
der Blickwinkel auch ohne Informationen über die Geometrie einen interaktiven 3D
Eindruck zu erzielen. Die vorliegende Arbeit hat die umgekehrte Intention und möchte
aufgenommene Lichtfelder dazu verwenden um die Geometrie der Szene zu rekonstruieren. Der Grund ist, dass Lichtfelder im Vergleich zu existierenden Verfahren der 3D
Rekonstruktion einen viel reicheren Informationsgehalt besitzen. Durch die reguläre
Abtastung des Lichtfeldes werden neben Information über die Geometrie ebenfalls
Materialeigenschaften abgebildet. Oberflächen, deren visuelle Erscheinung sich unter
Änderung des Betrachtungswinkels nicht konstant verhält, führen bei bekannten passiven
Rekonstruktionsverfahren zu großen Problemen. Das Verhalten solcher Oberflächen
unter Blickwinkeländerung wird in Lichtfeldern allerdings abgetastet und somit unmittelbar analysierbar.
Der wissenschaftliche Beitrag dieser Arbeit besteht aus verschiedenen Teilbeiträgen.
Es wird ein neues Verfahren vorgestellt, das aus den Rohdaten einer Lichtfeldkamera
(Plenopik Kamera 2.0) ohne explizite pixelweise Vorberechnung der Tiefeninformation
eine 4D Lichtfeldrepräsentation erzeugt. Diese spezielle Repräsentation, auch Lumigraph
genannt, ermöglicht den Zugang zu Epipolarebenen genannten 2D-Unterräumen dieser
Datenstruktur. Es wird ein Verfahren vorgestellt das aus einer Analyse dieser Epipolarebenen eine robuste Tiefenschätzung unter der Annahme Lambertscher Oberflächen
ermöglicht. Darauf aufbauend wird eine Erweiterung dieses Verfahrens auf kompliziertere
Materialien, zum Beispiel spiegelnder oder teiltransparenter Oberflächen, entwickelt.
Als Anwendungsbeispiele für die inherent vorhandene Tiefeninformation in Lichtfeldern
werden bekannte Verfahren wie Superresolution oder Objektsegmentierung auf Lichtfelder erweitert und mit Ergebnissen auf Einzelbildern verglichen. Außerdem ist im
Laufe dieser Arbeit eine große Benchmark Datenbank, bestehend aus simulierten und
realen Lichtfeldern entstanden, mit Hilfe derer die hier vorgestellten Verfahren getestet
werden, und die zukünftiger Forschung auf diesem Feld als Vergleichsbasis dienen soll.
5
Summary
This work is about the analysis of 4D light fields. In the context of this work a light
field is a series of 2D digital images of a scene captured on a planar regular grid of
camera positions. It is essential that the scene is captured over several camera positions
having constant distances to each other. This results in a sampling of light rays emitted
by a single scene point as a function of the camera position. In contrast to traditional
images – measuring the light intensity in the spatial domain – this approach additionally captures directional information leading to the four dimensionality mentioned above.
For image processing, light fields are a relatively new research area. In computer graphics,
they were used to avoid the work-intensive modeling of 3D geometry by instead using
view interpolation to achieve interactive 3D experiences without explicit geometry. The
intention of this work is vice versa, namely using light fields to reconstruct geometry of
a captured scene. The reason is that light fields provide much richer information content
compared to existing approaches of 3D reconstruction. Due to the regular and dense
sampling of the scene, aside from geometry, material properties are also imaged. Surfaces
whose visual appearance change when changing the line of sight causes problems for
known approaches of passive 3D reconstruction. Light fields instead sample this change
in appearance and thus make analysis possible.
This thesis covers different contributions. We propose a new approach to convert raw
data from a light field camera (plenoptic camera 2.0) to a 4D representation without
a pre-computation of pixel-wise depth. This special representation – also called the
Lumigraph – enables an access to epipolar planes which are sub-spaces of the 4D data
structure. An approach is proposed analyzing these epipolar plane images to achieve a
robust depth estimation on Lambertian surfaces. Based on this, an extension is presented
also handling reflective and transparent surfaces. As examples for the usefulness of this
inherently available depth information we show improvements to well known techniques
like super-resolution and object segmentation when extending them to light fields.
Additionally a benchmark database was established over time during the research for
this thesis. We will test the proposed approaches using this database and hope that it
helps to drive future research in this field.
Acknowledgements
I would like to thank Prof. Bernd Jähne for supervising me during this thesis and even
more for trusting in me when offering this PhD position. In a PhD project you can
often only be as good as the people teaching, supporting and guiding you.
I would also like to thank Dr. Janis Fehr who was my postdoc during the important first
months of the project which helped me a lot to find an access to the topic. Unfortunately
he left the institute about half a year after I joined the project due to a lucky event in
his life, the birth of his first child.
A few months later the former position of Janis Fehr was occupied by Dr. Bastian
Goldlücke. During the rest of my thesis I had a very fruitful working relationship
with Dr. Goldlücke. Our strengths and interests fit perfectly together resulting in
eight publications in two years and I would like to thank him for this nice, knowledge
expanding and interesting time and wish him the best for his upcoming professorship.
I thank PD. Dr. Christoph Garbe for agreeing to be my second referee.
I would also like to thank the collaborators from Robert Bosch GmbH, especially Dr.
Ralf Zink for supporting me and the fruitful discussions we had.
Many thanks to Dr. Bastian Goldlücke, Dr. Harlyn Baker, Christoph Straehle and Dr.
Michel Janus for reviewing and the feedback they gave.
My special thank goes to the Heidelberg Collaboratory for Image Processing in general,
and all the amazing colleagues there made it more than only a working place. The
wonderful secretaries Barbara Werner, Karin Kruljac and Evelyn Wilhelm always gave a
helping hand when needed. I really enjoyed the time at this institute and will definitely
miss it in the future.
9
Contents
1 Introduction
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
19
25
26
2 Light Fields
2.1 The Plenoptic Function . . . . . . . . . . . . . . . .
2.2 The Lumigraph Parametrization . . . . . . . . . . .
2.3 Epipolar Plane Images . . . . . . . . . . . . . . . .
2.4 Acquisition Of Light Fields . . . . . . . . . . . . . .
2.4.1 The Plenoptic Camera (1.0) . . . . . . . . .
2.4.1.1 Optical Design. . . . . . . . . . . .
2.4.1.2 Rendering Views From Raw Data.
2.4.2 The Focused Plenoptic Camera (2.0) . . . .
2.4.2.1 Optical Design. . . . . . . . . . . .
2.4.2.2 Rendering Views From Raw Data.
2.4.2.3 Refocusing. . . . . . . . . . . . . .
2.4.2.4 Generating All-In-Focus Views . .
2.4.3 Gantry . . . . . . . . . . . . . . . . . . . . .
2.4.4 Camera Arrays . . . . . . . . . . . . . . . .
2.4.5 Simulation . . . . . . . . . . . . . . . . . . .
27
27
28
29
31
31
31
32
33
33
33
35
36
37
37
38
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
39
. .
39
. . . 41
. .
42
. .
42
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
45
46
46
48
49
49
50
50
50
5 Orientation Analysis in Light Fields
5.1 Single Orientation Analysis . . . . . . . . . . . . . . . . . . . . . . . .
5.1.1 The Structure Tensor . . . . . . . . . . . . . . . . . . . . . . . .
5.1.2 Disparities On Epipolar Plane Images . . . . . . . . . . . . . . .
55
55
56
58
3 Lumigraph Representation from Plenoptic Camera Images
3.1 Rendering All In Focus Views Without Pixel-wise Depth . . .
3.2 Merging Views from Different Lens Types . . . . . . . . . . .
3.3 The EPI Generation Pipeline . . . . . . . . . . . . . . . . . .
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 Data Sets - The 4D Light Field Archive
4.1 The Light Field Archive . . . . . . . . .
4.1.1 The Main File . . . . . . . . . . .
4.1.2 Blender Category . . . . . . . . .
4.1.3 Segmentation Ground Truth . . .
4.1.4 Gantry category . . . . . . . . . .
4.2 Generation of the light fields . . . . . . .
4.2.1 Blender category . . . . . . . . .
4.2.2 Gantry category . . . . . . . . . .
11
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.2
5.1.2.1 Local Disparity Estimation . . . . . . . . .
5.1.2.2 Limits of the Local Orientation Estimation .
5.1.2.3 Consistent Disparity Labeling . . . . . . . .
5.1.3 Disparities On Individual Views . . . . . . . . . . . .
5.1.3.1 Fast Denoising Scheme . . . . . . . . . . . .
5.1.3.2 Global Optimization Scheme . . . . . . . .
5.1.4 Performance Analysis for Interactive Labeling . . . .
5.1.5 Comparison to Multi-View Stereo . . . . . . . . . . .
5.1.6 Experiments and Discussion . . . . . . . . . . . . . .
Double Orientation Analysis . . . . . . . . . . . . . . . . . .
5.2.1 EPI Structure for Lambertian Surfaces . . . . . . . .
5.2.2 EPI Structure for Planar Reflectors . . . . . . . . . .
5.2.3 Analysis of Multiorientation Patterns . . . . . . . . .
5.2.4 Merging into Single Disparity Maps . . . . . . . . . .
5.2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.5.1 Synthetic Data Sets . . . . . . . . . . . . .
5.2.5.2 Real-World Data Sets . . . . . . . . . . . .
6 Inverse Problems on Ray Space
6.1 Spatial and Viewpoint Superresolution . . . . .
6.1.1 Image Formation and Model Energy . .
6.1.2 Functional Derivative . . . . . . . . . . .
6.1.3 Specialization to 4D Light Fields . . . .
6.1.4 View Synthesis in the Light Field Plane
6.1.5 Results . . . . . . . . . . . . . . . . . . .
6.2 Rayspace Segmentation . . . . . . . . . . . . . .
6.2.1 Regularization on Ray Space . . . . . . .
6.2.2 Optimal Label Assignment on Ray Space
6.2.3 Local Class Probabilities . . . . . . . . .
6.2.4 Experiments . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
85
.
85
.
86
.
88
.
89
.
89
.
90
.
95
.
96
. . 97
.
98
. 100
.
.
.
.
.
.
58
60
67
67
67
69
69
72
72
77
77
78
79
80
81
82
82
7 Conclusion
105
8 Outlook
106
Bibliography
109
12
The body of the air is full of an infinite number of radiant pyramids caused by the objects
located in it. These pyramids intersect and interweave without interfering with each
other during the independent passage throughout the air in which they are infused.
Leonardo da Vinci (1452-1519)
I say that if the front of a building –or any open piazza or field– which is illuminated by
the sun has a dwelling opposite to it, and if, in the front which does not face that sun,
you make a small round hole, all the illuminated objects will project their images through
that hole and be visible inside the dwelling on the opposite wall which may be made white;
and there, in fact, they will be upside down, and if you make similar openings in several
places in the same wall you will have the same result from each. Hence the images
of the illuminated objects are all everywhere on this wall and all in each minutest part of it.
Leonardo da Vinci (1452-1519)
1
1.1
Introduction
Motivation
Depth imaging has been a highly active research area for decades. Considering the vast
number of application areas, this is not very surprising. These range from industrial
inspection to robotics, from automotive to surveillance – to name only a few that are
long established. However, in the last years, new areas of interest have emerge to drive
the developments in that field. Recent advances in the mobile and gaming industry
offer more and more depth range data and the upcoming era of 3D printing and rapid
prototyping is currently opening a new field of interests in 3D reconstruction. This great
demand of depth imaging resulted in a wealth of techniques and devices. One of the
first established is the so called stereo imaging or triangulation. Inspired by the visual
system of mammals, two cameras can be placed next to each other looking in the same
direction. The resulting images can be used to determine the shift between objects in
the corresponding images which is related to the distance of the object to the image
planes. Stereo imaging is one of the most well developed approaches considering the
number of existing setups and algorithms. The reason for this success is the simplicity
of the system and that the algorithms are relatively straightforward, at least for the
basic approaches.
xr
^
Z
f
b
P (X,Y,Z)
xl
Figure 1: Stereo camera setup. An object at P = (X, Y, Z) is mapped onto two camera sensors.
In the left camera at xl and in the right camera at xr . The camera sensor centers have a distance b
from each other – also called baseline. For objects far away from the camera lens, the parameter f is
equivalent to the focal length.
As depicted in figure 1, a stereo setup consists of two cameras at a distance b. A point
P is then projected onto different pixel positions on the image plane. The difference of
the relative projections of P is xr − xl known as parallax or disparity d. The disparity is
inversely proportional to the distance Z of the object (see equation 1) [45].
!
b
b
X
+
X
−
f
2
2
d = xr − xl = f
−
(1)
=b
Z
Z
Z
19
The statistical uncertainty can be derived using Gaussian error propagation
∆Z =
Z2
∆d
bf
(2)
In equation 2 one can see that the uncertainty ∆Z of the measured depth increases with
the square of the distance Z 2 [45].
Algorithmically – due to the vast number of methods handling the stereo problem – we
will only discuss a very basic approach as an example. A well known method consists of
3 computation steps [83] [52]:
1. compute matching costs of intensities Ir ,Il at disparity d using for example one of
the following cost functions. Sum of Absolute Differences (SSD), Sum of Squared
Differences (SAD), Normalized Cross Correlation (NCC).
X
SAD(x, y, d) =
|Il (x, y) − Ir (x, y − d)|
x,y∈W
SSD(x, y, d) =
X
x,y∈W
(Il (x, y) − Ir (x, y − d))2
(3)
P
x,y∈W Il (x, y) · Ir (x, y − d)
P
N CC(x, y, d) = P
( x,y∈W Il2 (x, y)) · ( x,y∈W ·Ir2 (x, y − d))
2. for each disparity assumption, sum matching costs over a square window.
3. select optimal disparity as the minimal aggregation cost (AGC) at each pixel.
dopt (x, y) = argmin AGC(x, y, d) where AGC is SAD, SSD or N CC
(4)
From equation 3, it is quite obvious that to match correspondences, the presence of
texture variance is obligatory. If no high frequency textures are available, other triangulation techniques have been developed which use active light sources to replace
the missing texture. Those can be implemented as a series of stripe patterns matching
the pattern deformation or as a projection of random patterns onto the objects, to
give only two examples. Disadvantages of the active methods are that they do not
work under arbitrary lighting conditions or on reflective materials. However, the main
problem of all triangulation based algorithms is the fact that the underlying principle is a
correspondence search. This means that the same features or regions in all corresponding
images need to be found to determine the relative shift between them. However, the
basic prerequisite for this is, that the appearance or the color of those regions stays the
same from both viewpoints. This so-called Lambertian assumption, namely that the
observed color of a 3D point is independent of the point of view, is the main problem of
correspondence search because most materials do not behave like this.
20
Another technique to measure the depth of a scene is Time-of-Flight (ToF) imaging.
ToF is an active range estimation method based on measuring the time a light pulse
requires to travel from the emitter to the object and back to the camera. A quite old
and famous 1D realization is the LIDAR (Light Detection And Ranging) [87], often
used in the field of self-driving cars.
The important equation for a ToF system is
τ=
2z
,
c
(5)
where τ is the travel time, z the distance of an object to the camera and c the speed of
light. The cameras consist of four main components, an illumination unit, an optic, a
sensor and a complex electronic read out unit. The illumination unit often consists of
LEDs or laser diodes emitting in the near infrared spectrum. There are two possible
operation modes. Either the LEDs emit light pulses or a continuous wave modulation.
In the second case, instead of traveling time a phase shift is measured.
illumination unit
signal emitted
object
sensor
range map
t4
t3
t2
t1
signal reflected
t4
t3
t2
t1
optic
d4
d3
d2
d1
read out unit
Figure 2: A simplified sketch of a Time-of-Flight camera. A signal is emitted by the illumination
unit, reflected by an object and measured on the sensor element able to measure time dependent on
the incoming intensities. The signal is then processed by the read-out unit to estimate the distance
of the reflected signals measured at each pixel.
Time-of-Flight cameras can measure distances between a few centimeter and a few dozen
meters. For periodically modulated light sources, there is a natural limitation of the
maximum measurable range due to an ambiguity of periodic signals with a phase shift of
∆φ = 2πk,
21
k ∈ N.
(6)
the maximum range that can be measured then depends on the frequency ν of the light
source
c
dmax =
.
(7)
2ν
These ranges can be extended i.e. by using combined measurements of multiple modulation frequencies [36]. ToF cameras using pulsed illumination have similar problems. The
depth range limitation is not driven by a non ambiguity of the signal as in equation 6
but by the integration time necessary to wait for back-projected light pulses.
The sensors in ToF cameras are much more complicated than in a standard digital
camera. Every pixel must be able to measure the light travel time separately. Thus
the pixels are huge (around 100µm) compared to the pixels of a CCD sensor which are
around 10µm. This leads to one of the main disadvantages of this type of range camera,
quite poor resolution. Currently they achieve sensor resolutions of around 200 × 200
pixels. Another disadvantage is that the distance measurement only works on materials
able to reflect the light frequency of the illumination unit. Also multiple reflections in
the scene as well as mutual interferences between different ToFs are known problems.
The accuracy of the depth measurement theoretically does not depend on the distance
of the objects, but in practice it does. Due to the fact that light intensity I drops off
with 1/z 2 , the signal to noise ratio increases for objects with increasing distance to the
sensor, which of course affects the accuracy.
A third important technique to estimate depth ranges is the so called Interferometry.
This is a method based on measuring the interference of a reference beam and the beam
reflected by an object. The principle is sketched in figure 3. A source emits light which
first goes through an aperture and a collimator lens to create a planar and coherent
wavefront. A beam splitter separates the wavefront into a measuring and a reference
beam. The reference beam is reflected by a mirror back to the beam splitter where
it is reunited with the measuring beam backscattered by the object surface. If both
path lengths from the reference and the measuring beam are the same, by constructive
interference, the reunited beam causes a maximum intensity signal in the CCD camera
measuring it. By moving the reference or the object arm, a scanning of the surface
can be achieved. The accuracy of measurement is in the range of the wavelength used,
which means in the scale on nanometers. A price for this precision is that much effort is
necessary to stabilize the system. Mechanical and thermal disturbances are critical. It
is also very hard to apply this technology to measure bigger objects, thus interferometry
is widely used in scientific and industrial environments to measure small objects very
precisely, but it is not a very flexible range estimation technique.
Aside from the depth imaging techniques discussed up to now, light field photography is
developing as a new technology for high quality passive range estimation. A light field
is a dense and regular sampling of a scene. This enables a measurement of spatial and
direction dependent intensities instead of only spatially measured intensities. In fact,
this is a quite old idea which goes back to the early 20th century, and Gabriel Lippmann
22
mirror
movable object
CCD camera
beam splitter
lenses
light source
Figure 3: Sketch of an Interferometer setup. A light source emits light trough a pinhole and a
collimator lens to guarantee planar and coherent wavefronts. The light rays then are separated by
a beam splitter into a measuring and a reference beam reunited again in the beam splitter before
captured by a CCD camera. Same path length of measuring and reference beam causes constructive
interference which can be used to measure the object surface elevation.
who first thought of this idea named it Integral Photography [61]. His method of light
field capture was more or less ignored for 100 years before being rediscovered about
10 years ago. First people researching in computer graphics discovered light fields as a
possible solution to skip the 3D geometry modeling stage by instead using a collection
of images to interpolate intermediate views of a scene resulting in an interactive 3D
experience. In fact, this statement is simplified because there are various techniques
established in Image-based rendering [23, 24, 41, 54, 58, 65, 85, 86, 89] which can be
classified as techniques based on rendering without geometry, with implicit geometry
and with explicit geometry. However, the main goal is usually the generation of novel
views from existing images of a scene with or without geometry present. Reviews of this
techniques can be found in Kang [48] and Shum [88].
In the computer vision community people are more interested in sampling a light field
of a scene to explicitly reconstruction the geometry. Ways to capture light fields are
diverse, but the principle is always to achieve a dense sampling of the cones of light
rays emitted by each point on the surface of a captured object. We will see in this work
that a dense and regular sampling of a scene, what we call a light field, allows more
that just a reconstruction of geometry. Due to the mentioned sampling constraints, also
material properties of the captured objects are mapped onto the sensor(s). This becomes
clear when realizing that material properties can be described using the Bidirectional
Reflectance Distribution Function (BRDF). The BRDF is a function describing the
measured intensity of an opaque surface depending on the incoming and outgoing light
rays and the normal vector of the surface. The fundamental assumption of algorithms
based on triangulation is that the observed scene point behaves Lambertian, which is
23
equivalent to a constant BRDF. In other words, the color of an observed object point
does not depend on the observation direction. In reality not many existing materials
fulfill this assumption. A lot of research has been done in the area of stereo vision to
design algorithms more robust against such glossiness effects. Although most objects
in natural scenes can be seen as Lambertian, the problems increase the higher resolved
or the nearer to the camera objects are. Especially for tasks of high quality 3D object
reconstruction, playing a role for example in industrial inspection, non-Lambertian
effects gain in importance and need to be handled. To gain robustness, stereo setups can
be extended to multiple cameras, providing more views of the same object point and
thus to more possible correspondence. But due to the fact that all such algorithms still
are based on searching for corresponding features in different images, this only comes
with more and more complexity of the algorithms and increasing computation time.
All these methods try to combat a lack of, or an ambiguity, in information with ever
improving error handling. If instead a light field camera samples a subset of the light
rays emitted by an object, it actually performs a sparse sampling of the BRDF. This
makes reconstruction of the geometry and also of the BRDF possible. Methods analyzing
light fields thus inherently should be more robust against non-Lambertian effects, because information about the material is really measured and not only causing ambiguities.
The goal of this thesis is an analysis of light fields from the computer vision point of
view. We use a specific parametrization throughout the entire work, called a 4D light
field or Lumigraph, which is a well suited representation, for example, giving access
to effects caused by the BRDF. The main contributions are methods to analyze 4D
light fields primarily aimed at geometry reconstruction of objects under Lambertian and
non-Lambertian assumptions.
The work presented in this thesis was funded by Robert Bosch GmbH, Stuttgart.
24
1.2
Outline
Section 2: We give an overview over light fields in the context of image processing. In
contrast to computer graphics, where light fields were developed to avoid the necessity
of geometry, we are interested in acquiring them to reconstruct geometry as a primary
goal. Due to that, we first introduce the most general definition of light fields before
discussing the parametrization and data representation we use in this work. After that,
we recap different methods of real world light field acquisition as well as simulation
using computer graphics.
Section 3: After an introduction of epipolar plane images and their benefits for the
analysis of light fields as well as a discussion of the problems of Focused Plenoptic Cameras, we present an algorithm to compute a real 4D representation from Plenoptic 2.0
Camera raw data without a pre-computation of an explicit pixel-wise depth estimation.
This section is based on the publication ”Generating EPI representations of 4D Light
fields with a single lens focused plenoptic camera” [104].
Section 4: This section discusses the 4D light field data used in this work. We
introduce our benchmark data set consisting of simulated and real world light field
data providing ground truth depth at least for the center view. The corresponding
publication is ”Datasets and Benchmarks for Densely Sampled 4D Light Fields” [105], a
joint work with Bastian Goldlücke and Stephan Meister.
Section 5: In this section, we discuss geometry reconstruction using light fields, in particular epipolar plane images. Compared to methods based on correspondence search, we
propose an orientation analysis on epipolar plane images which can be implemented via
fast and robust filtering approaches. The section splits into two parts, single orientation
and double orientation estimation. The range estimation using single orientation analysis
is based on the Lambertian assumption that the color of a scene point is independent
of the viewpoint. This part is based on the publications ”Globally consistent depth
labeling of 4D light fields” [100] and ”Variational Light Field Analysis for Disparity
Estimation and Super-Resolution” [103]. In general the Lambertian assumption is not a
valid description for real world objects. If materials show a shininess or, even worse, act
as a mirror, this has an effect on the structure of the epipolar plane images. How this
effect looks and how to analyze these more complex epipolar planes we discuss in the
second part of this section, the double orientation analysis. This part is based on the
publication ”Reconstructing Reflective and Transparent Surfaces from Epipolar Plane
Images” [102]. Along with the local analysis of the epipolar planes we discuss variational
frameworks to improve the results and to guarantee global consistency. The theoretical part of this optimization techniques as well as fast CUDA [71] implementations of
the developed algorithms, gathered in the cocolib [37], are the work of Bastian Goldlücke.
25
Section 6: With the readily available range information light fields are providing,
further scene analysis can be done. We develop two frameworks based on light field
processing. First we discuss a super-resolution framework tailored to light fields. The
corresponding publications are ”Spatial and angular variational super-resolution of 4D
light fields” [101] and ”Variational Light Field Analysis for Disparity Estimation and
Super-Resolution” [103]. A second project is object segmentation in light fields. Here,
we will see that light fields are highly suitable for segmentation tasks. Problems of
classifiers acting on single image domains can be overcome and by labeling rays consistent
over the entire light field much better results compared to single pixel labeling can be
achieved. The publication corresponding to this part is ”Globally Consistent Multi-Label
Assignment on the Ray Space of 4D Light Fields” [106]. The publication is a joint
work with Bastian Goldlücke and Christoph Straehle, whereby also here the theoretical
background as well as fast implementations on the GPU of the variational methods are
the work of Dr. Goldlücke.
1.3
Contribution
The following is a list of what the author believes to be the novel contributions of this
thesis:
• a novel algorithm to convert raw data of Plenoptic 2.0 Cameras into the Lumigraph
representation without pre-computing a pixel-wise distance measure.
• a benchmark database consisting of real-world and simulated 4D light fields
providing ground truth depth and partly ground truth object labels.
• a new approach for range estimation using orientation analysis in light fields
• an extension of the single orientation analysis to double orientation patterns for
reconstructing reflections and transparencies.
• an evaluation of applications of the orientation analysis such as super-resolution
of light fields and ray-space segmentation.
26
2
2.1
Light Fields
The Plenoptic Function
One of the fundamental papers introducing the concept of light fields is ”The Plenoptic
Function and Elements of Early Vision” from Adelson and Bergen [2]. They ask the
question what the actual information about the world is which is contained in the light
filling the space an observer is looking at. Starting from this question they develop a
theory of the plenoptic function.
Figure 4: A widely spread visualization of the plenoptic function. We cite the original caption of
the figure: ”The plenoptic function describes the information available to an observer at any point
in space and time. Shown here are two schematic eyes-which one should consider to have punctuate
pupils-gathering pencils of light rays. A real observer cannot see the light rays coming from behind,
but the plenoptic function does include these rays.” (Adelson and Bergen [2])
.
If we capture a gray value image of a scene – using a pinhole camera– we select a
cone shaped bundle of rays at a specific position in space V0 and accumulate their
intensities on the sensor of our camera. Thus we measure an intensity distribution
P (θ, φ) or P (x, y), depending on the type of coordinate system we use. Taking the lights
wavelength λ into account we can add another dimension P (x, y, λ). If in a next step we
measure the whole space V ∈ R3 instantaneously we gain three dimensions more, and
when also including the time t, we end up with a seven dimensional function describing
the entire information about the light filling the space over time
P (x, y, λ, Vx , Vy , Vz , t).
27
(8)
There might be even more dimensions if we consider the polarization states of the light
rays, but we neglect this here and concentrate in the following on the plenoptic function
in equation 8. In general, this function definition seems a bit abstract, but in fact,
every imaging device samples sparse subsets of this function. Considering this, the
concept of a plenoptic function can serve as a general framework to think about possible
imaging modalities. Another example, besides the standard pinhole camera which can
be described as P (x, y), this work is about a sampling of the plenoptic function of type
P (x, y, Vx , Vy ) [107].
2.2
The Lumigraph Parametrization
In the previous section 2.1, we discussed the plenoptic function as describing the entire information on the light filling the space around an object. Due to the fact that
all imaging techniques are sparse samples of this general function, in this section we
introduce the sparse sampling or parametrization this work concentrates on.
If we assume a sampling of the plenoptic function using a gray value camera, we can first
neglect the wavelength λ dependency in equation 8. Furthermore this work is about
static scene reconstruction, so we are not interested in optical flow estimation or any
other time dependent properties of the scene and thus can cancel out the dimension
t as well. Another reduction in dimensionality can be achieved if we assume that the
intensity of a light ray does not depend on the actual position on the ray, which is
equivalent to the assumption that we parametrize the light field on a surface Σ outside
of the convex hull of the scene (compare figure 5 left).
Several ways to represent light fields have been proposed. Here, we adopt the light field
parametrization from early works in motion analysis from Bolles et al. [16] and the work
about light field sampling from Gortler et al. [41]. The idea of a convex hull to reduce
the plenoptic function has also been used by Benton [10] and similar ideas can be found
in Ashton [5], where the movement of the camera is restricted to a spherical surface for
an illumination analysis.
One way to look at a 4D light field is to consider it as a collection of pinhole views from
several view points parallel to a common image plane (see figure 5). The 2D plane Π
contains the focal points of the views, which we parametrize by the coordinates (s, t),
and the image plane Ω is parametrized by the coordinates (x, y). A 4D light field or
Lumigraph [41] then is a map
L : Ω × Π → R,
(x, y, s, t) 7→ L(x, y, s, t).
(9)
It can be viewed as an assignment of an intensity value to the ray passing through
(x, y) ∈ Ω and (s, t) ∈ Π.
28
(a) Opaque surface only assumption
(b) Two plane parametrization
Figure 5: (a) Dimensionality reduction of the plenoptic function through parameterizing the light
field on a surface Σ by assuming that the intensity of light rays does not change in the free space
between object surfaces and the imaging device. In other words, the intensity of the rays from the
object at P which are occluded by the object at P ′ is not of interest. (b) Each light ray can be
parametrized by the intersection point with two planes. Each camera location (s∗ , t∗ ) in the 2D
plane Π yields a different pinhole view of the scene. Together with the second intersection (x∗ , y ∗ ) at
the image plane Ω we can parametrize a light field as a four dimensional subspace L(x, y, s, t) of the
plenoptic function (see equation 9).
2.3
Epipolar Plane Images
For the problem of estimating the 3D structure of a sampled scene, we consider the
structure of the light field, in particular on 2D slices through the field. We fix a horizontal
line of constant y ∗ in the image plane and a constant camera coordinate t∗ , and restrict
the light field to an (x, s)-slice Σy∗ ,t∗ , respectively to an (y, t)-slice Σx∗ s∗ . The resulting
map is called an epipolar plane image (EPI). This idea goes back to Bolles et al. [16].
Sy∗ ,t∗ : Σy∗ ,t∗ → R,
(x, s) 7→ Sy∗ ,t∗ (x, s) := L(x, y ∗ , s, t∗ ).
(10)
Let us consider the geometry of this map (compare figures 5 and 6). A point P = (X, Y, Z)
within the epipolar plane corresponding to the slice projects to a point in Ω depending
on the chosen camera center in Π. If we vary s, the coordinate x changes according to
∆x =
f
∆s,
Z
(11)
where f is the distance between the parallel planes. Note that to obtain this formula ∆x
has to be corrected by the translation ∆s to account for the different local coordinate
systems of the views. Interestingly, a point in 3D space is thus projected onto a line
in Σy∗ ,t∗ , where the slope of the line is related to its depth. This means that the intensity
29
Figure 6: The left side depicts a collection of images sampling a 3D scene. The images are captured on planar 2D grid with constant baselines. This is what is called the Lumigraph parametrization (section 2.2). By fixing an angular dimension (visualized via a red box) we extract a 3D subspace of the Lumigraph. If we imagine this image sequence as a volume (x,y,s) and cut out a slice
along the s-axis, which is equivalent to fixing another spatial dimension (visualized via a green line),
the result is an epipolar plane image. In this subspace a point in the world is mapped onto a line
whose slope corresponds to the distance of the point to the camera.
of the light field should not change along such a line, provided that the objects in the
scene are Lambertian. Thus, computing depth is essentially equivalent to computing
the slope of level lines in the epipolar plane images. Of course, this is a well-known
fact, which has already been used for depth reconstruction in previous works [16, 27].
In sections 5.1 and 5.2, we describe and evaluate novel approaches on how to obtain
slope estimates for Lambertian and for non Lambertian assumptions.
Z' > Z
Z'
Z
Z
Z''
P(X,Y,Z)
>
Z ''
Z
b
Figure 7: Sketch of a linear camera array. Cameras are lined up with constant baselines b. This
leads to a linear mapping of a 3D point onto the sensors. The slope of these lines depends on the
distance Z of P to the image plane.
30
2.4
2.4.1
Acquisition Of Light Fields
The Plenoptic Camera (1.0)
The early beginnings of light field imaging or Plenoptic Cameras are strongly connected
to Ives (1903) [44] and Lippmann (1908) [61]. Lippmann realized that classical photography, like drawings, only shows one part of the whole and dreamed of an imaging
device able to render – ”the full variety offered by the direct observation of objects”.
One of his drawings from 1908, depicted in figure 8, already gives some insight in todays
realization of Plenoptic Cameras.
Figure 8: Early drawing of Lippmans so-called integral camera (1908) [61]
The modern approaches of building Plenoptic Cameras are mainly influenced by the
works of Adelson et al. [3] and Ng et al. [70]. Due to the fact that the principles in
detail are well described in those publications and that this work does not deal with
data of early versions of Plenoptic Cameras, we will here only give a short overview of
the basic concepts of the realization and the algorithmic of rendering images from the
sensor raw data following Ng et al. [70].
2.4.1.1 Optical Design. The Plenoptic Camera 1.0 is based on a usual camera
with a digital sensor, a main optics and an aperture. The difference from a normal
camera is a micro-lens array placed on the focal plane of the main lens exactly at a
distance fM LA from the sensor. (see figure 9). This means the micro-lenses themselves
are focused at infinity. In contrast with a usual camera which integrates the focused
light of the main lens on a single sensor element – the micro-lenses split the incoming
light cone by the direction of the rays mapping them onto the sensor area below the
corresponding micro-lens.
This means that one has direct access to the intensity of a light ray L(x∗ , y ∗ , s∗ , t∗ ) of
the light field by choosing the micro-image of the micro-lens at (x∗ , y ∗ ) – encoding the
spatial position– and a pixel of the corresponding micro-image (s∗ , t∗ ) – encoding the
direction. It should be noted that the size of each micro-lens is coupled to the aperture
or f-number of the main optics. If the micro-lenses are too small compared to the main
31
f MLA
MLA (focal length fMLA )
Microlens images on sensor
m1
m1
m2
m3
m2
m3
Object
rendered center view
Chip
IP
ML
Figure 9: Left: One dimensional sketch of a Plenoptic Camera 1.0 setup. Light rays emitted by
the object are focused by the main lens (ML). The microlens array (MLA) is placed at the image
plane (IP) of the main lens and thus separates the rays by their direction, mapping them onto the
sensor. Right: Illustrates the rendering of a single view point, here the center view, by collecting
the center pixels of each micro image mi .
aperture the micro-images overlap each other or – the other way around – it is a waste
of sensor area if the micro-lenses are too big. Due to the fact that light passing the
main aperture also has to pass a micro-lens before getting integrated on a squared pixel,
what actually happens is that the camera measures small 4D boxes of the light field
entering the camera instead of single rays.
2.4.1.2 Rendering Views From Raw Data. Rendering a projective view from
sensor raw data is quite simple as depicted in figure 9.
1. Determining a specific projective view means determining a relative position ps∗ ,t∗
within the micro-images, for example the center position pcenter .
2. Define an output image Is∗ ,t∗ as M × N matrix where M, N ∈ N are the number
of micro-lenses in vertical and horizontal directions.
3. Assigning the pixel (x, y) in Is∗ ,t∗ the intensity of the pixel ps∗ ,t∗ in the micro-image
corresponding to the micro-lens qx,y .
Changing the relative position ps∗ ,t∗ for rendering means changing the virtual aperture
which results in an projective view from a slightly different viewpoint. An integration
of images from neighboring viewpoints is used to create a depth of field and enables
computationally refocusing by varying the relative positions of these neighbored images.
”In quantized form, this corresponds to shifting and adding the sub-aperture images ...”
Ng et. al [70].
In fact the description above neglects a very important calibration step necessary
beforehand. Rendering views from the camera raw data is only that easy if the data are
rectified and distorted to satisfy the conditions necessary for a successful rendering. A
32
detailed description of a possible calibration process is out of the scope of this work but
can be found in Dansereau et al. [28].
2.4.2
The Focused Plenoptic Camera (2.0)
Besides the Plenoptic Camera (see section 2.4.1), another optical setup for a compact
light field camera has recently been developed, the Focused Plenoptic Camera, often also
called the Plenoptic Camera 2.0 [62, 64, 73]. The main disadvantage of the Plenoptic
Camera 1.0 is the poor spatial resolution of the rendered views, which is equal to the
number of micro-lenses. By changing the optical setup a little bit one can increase the
spatial resolution dramatically.
2.4.2.1 Optical Design. The main difference in the optical setup between the
Plenoptic Camera 1.0 and 2.0 is the relative position of the micro-lens array. The
micro-lenses are no longer placed at the principal plane of the main lens and focused
to infinity, but are now focused onto the image plane of the main lens. The result is
that each micro-lens then acts as a single pinhole camera, ”looking” at a small part
of the virtual image inside the camera. This small part is then imaged with a high
spatial resolution onto the sensor as long as the imaged scene point is in the valid region
between the principal plane of the main lens and the image sensor. Scene features
behind the principal plane cannot be resolved. The effect is that scene points – that
are not in focus of the main lens but within this valid region – are imaged multiple
times over several neighboring micro-lenses, thus encoding the angular information over
several micro-images(see also figure 11 and Lumsdaine et al. [62] or Perwass [73]). This
makes it possible to encode angular information and preserve high resolution at the
same time. But this comes with a price. First, light field encoding is complicated and,
second, due to the multiple imaging of scene features, rendered images from this camera
have also a much lower resolution than the inherent sensor resolution promises.
2.4.2.2 Rendering Views From Raw Data. The rendering process requires a
one time scene independent calibration, which extracts for all micro-lens images (microimages) the position as well as their diameter dM L . In this work, we use a commercially
available camera [72], which has a micro-lens array where the lenses are arranged in a
hexagonal pattern.
Due to this lens layout, we also use a hexagonal shape for the micro-images and address
them with coordinates (i, j) on the sensor plane. We define an image patch pbij as a
micro-image or a subset of it. Projective views are rendered by tiling these patches
together [33, 63].
The center of a micro-image (i, j), determined in the coordinate system given by the
initial camera calibration process, is denoted by ~cij . The corresponding patch images
33
In Focus
Out of Focus
Micro lens images on sensor
In Focus
Out of Focus
Chip
MLA
IP
ML
FP
I
microlens image
1 px (pixel)
Figure 10: Left: One-dimensional sketch of a Plenoptic Camera 2.0 setup. Light rays emitted
by the object are focused by the main lens (ML) onto the image plane (IP). The micro-lens array
(MLA) is placed so that the micro-lenses are focused onto the image plane of the main lens, mapping
fractions of the virtual image onto the sensor. Green rays are coming from an object in focus of the
main lens (FP), blue rays of an object away from the principal plane of the main lens.
Right: Illustrates the resulting micro-images of an object in and out of focus.
are defined as ωij (δ, ~o), where δ denotes the size of the micro-image patch pbij (δ, ~o) in
pixels and ~o is the offset on the sensor plane of the micro-image patch center from ~cij .
We define ωij (δ, ~o) as an m × n matrix, which is zero except for the positions of the
pixels of the corresponding micro-image patch pbij (δ, ~o):

0
...
 ...

.
..
~
ωij (δ, ~o) = 
p
b
ij δ, 0


...

0
...
0



.. 
.



(12)
0
m × n is the rendered image resolution and (i, j) is the index of a specific image patch,
imaged from microlens (i, j) (see figure 11). A projective view Ω (δ, ~o) of a scene is then
rendered as:
Ω (δ, ~o) =
Ny Nx
X
X
ωij (δ, ~o)
i=1 j=1
δ ∈N | 1 < δ ≤ dM L
dM L δ
~o ∈ N2 | 0 ≤k ~o k≤
− ,
2
2
(13)
where (Nx , Ny ) is the number of micro-lenses on the sensor in x- and y-directions. The
choice of the parameters δ and ~o directly controls the image plane and point of view of
the rendered view.
34
(a) Micro images
(b) Plenoptic 2.0 Camera raw data example
Figure 11: (a) Micro-images and their centers ~cij are indicated as well as the resulting image
patches pbij and their border pixels b(b
pij ). (b) Raw data from a Plenoptic Camera 2.0 [73]. All possible optical effects are visible here. The green box in the upper left corner shows the transition from
a region in the scene behind the principal plane of the camera main lens to a region exactly on the
focal plane so that the imaged fragments perfectly fit together. The red boxes show magnified regions of the scene between the principal plane of the main lens and the sensor so that scene features
are imaged multiple times over several neighbored micro images. The amount of multiple feature
occurrence depends on the distance to the image plane.
2.4.2.3 Refocusing. It is obvious that rendering a projective view here is a bit more
complex than it is for the Plenoptic Camera 1.0, where one can simply extract single
pixels from the raw data (see section 2.4.1). The reason is the different sampling of the
light field in the devices. While a micro-lens in a Plenoptic Camera 1.0 is focused at
infinity and thus decomposes the light rays emitted by a 3D point into their directions,
a micro-lens of a Focused Plenoptic Camera acts as a single pinhole camera looking at a
small subset of the virtual image of the scene. This leads to a much higher resolution but
spreads the directional information over multiple micro-images. This causes so-called
plenoptic artifacts during rendering. The choice of the patch size δ defines a specific
image or virtual depth plane in the 3D scene. Neighbored patches pbij with a size delta
fit perfectly together for all imaged scene features from the corresponding virtual depth
plane. Patches whose content is a imaged region not lying on this virtual depth plane,
either lack information or the multiple occurrence of scene features is still present. These
are the mentioned plenoptic artifacts which occur for a fixed patch size δ all over the
rendered image, except for the specific virtual depth plane (compare figure 12, or as
another example, Lumsdaine et al. [64] Fig. 11). Due to the fact that the multiple
occurrence of image features over the micro-images depends on the distance to the
camera, image planes nearer to the camera need to be rendered with smaller patch sizes
δ and thus show a loss in resolution. Full resolution is only present at the image plane
of maximum δ, which is the principal plane of the main lens. A possibility to handle
35
Figure 12: Top: Rendering different image planes and views illustrated on the basis of the 1D
sketch depicted in figure 10. On the left side the effect of different patch sizes δ is depicted, on the
right the effect of changing the offset ~o. Bottom: Illustrates on the left a micro-image patch as well
as the effect of different values of δ and offsets ~o. On the right the reason for plenoptic artifacts is
visualized.
the plenoptic artifacts is described in Lumsdaine et al. [25]. They call it blending and
achieve with this technique much more realistic looking refocusing results. Another
approach can be found in Georgiev et al. [34]. We will propose our approach in this
work in section 3.
2.4.2.4 Generating All-In-Focus Views An important aspect of the Focused
Plenoptic Cameras is that beside the opportunity of computationally refocusing, one
could also be interested in rendering images with the largest possible depth of field. This
means removing the plenoptic artifacts or in other words, eliminating all duplicated
scene features captured by the individual micro-lenses.
The common thread of this work is an analysis of light fields based on the analysis of
epipolar plane images (see section 2.3). We will see in the following sections how to
create and analyze them in detail. In this context it is only important to know that
from a Focused Plenoptic Camera, we need to render all possible All-In-Focus Views to
get access to them. Therefore we will recap in this section some related work to render
those full depth of field views and will discuss a new approach in section 3.
We will now quickly discuss two approaches treating this issue. The first is from Perwass
36
et al. [73]. It is mainly based on triangulation to obtain a pixel-wise depth map over
the micro-images (an overview of existing triangulation algorithms can be found in
Scharstein and Szeliski [83]). For this, the correlation or sum-of-absolute differences
(SAD) is computed over micro-image pairs. This of course works only where a local
contrast is present. The computed virtual depth per pixel value gives a hypothesis
for a projection cone defining the occurrence of the same image feature in neighboring
micro-images. By integrating all connected pixels over those cones, a final image without
multiple occurrence of scene features can be rendered. A quite similar approach using
multi-view stereo is described in Bishop et al. [15].
Another approach is from Georgiev et al. [33]. They define a sub window patch of a
micro-image – similar to those depicted in figure 11 – and compute the cross correlation
of this patch along the x-axis of the left and right neighbored micro-images as well as
along the y-axis of the top and bottom neighbored micro-images. This results in a shift
from one micro-image to the other. Knowing this shift and the chosen window size of
the initial patch used for searching, an optimal patch size for this micro-image can be
computed. By tiling all patches with optimal patch size together, a full depth-of-field
view can be rendered.
2.4.3
Gantry
One of the most simple and inexpensive opportunities to sample a light field is a gantry
(see figure 13 left). This is a precise xy-axis stepper motor driving a normal camera
along a regular 2D grid. The benefits besides the simplicity are that baselines down to
a millimeter are realizable and no color or optics correction over several cameras are
necessary. A disadvantage is that only static scenes under static lighting conditions can
be captured, and the mobility is restricted.
2.4.4
Camera Arrays
Due to the fact that the analysis in this work deals with epipolar plane images, the
output of camera arrays is the most convenient data structure. They offer a fast and
direct access to the 4D Lumigraph (compare section 2.2) and the EPIs (section 2.3).
Additionally camera arrays are also suitable for capturing dynamic scenes. Another
benefit is that a camera array not necessarily limits the spacial resolution as Plenoptic
Cameras (sections 2.4.1 and 2.4.2) do. The main disadvantages are that they are costly
due to the amount of cameras but also due to the hardware necessary for synchronization
and data access. Additionally they also need a more complicated calibration process.
Besides the external and internal calibration it is also important to apply a color and
noise calibration due to the different sensor behavior of individual cameras.
37
Camera
Platform
(a) Gantry - xy stepper motor
(b) Two possible realizations of a camera array
Figure 13: (a) Depicts our precise gantry xy-axis stepper motor. Indicated are the x and y axis
through the red arrows as well as the camera platform. Many thanks to the Robert Bosch GmbH
for loaning this device. (b) Depicts on the left a prototype of a 6 × 6 array camera. Many thanks to
Harlyn Baker for providing this image. Right side shows a camera array from Stanford [97].
2.4.5
Simulation
In this work, we often make use of simulated light fields. They offer, besides an inexpensive data generation, an easy access to interesting properties like ground truth
for the geometry or the objects themselves, the material properties, as well as the
opportunity to simulate the sensor’s noise behavior. This makes simulation a great
tool for algorithm development and for evaluation. We use the open source software
Blender [77]. Blender offers an API accessible via Python [98]. This allows a scripting
of all objects in the 3D environment. We simulate the light fields by scripting the
blender camera to sample the 3D scene on a regular 2D grid. This is exactly the
data format of a gantry (section 2.4.3) or a camera array (section 2.4.4) device. It
should be noted her that with the rendered Lumigraph (compare section 2.2) and the
ground truth depth provided through Blender, a simulation of Plenoptic Camera 1.0
(section 2.4.1) data is also quite easy to achieve. Simulation of a Focused Plenoptic
Camera (section 2.4.2) is not so trivial, and needs a more complex simulation of the optics.
Together with light fields captured using a gantry mentioned in section 2.4.3 we offer a
benchmark database for light field analysis consisting of real world and simulated 4D
light fields (see section 4 and figures 18 and 19).
38
3
Lumigraph Representation from Plenoptic Camera Images
In the following sections of this work, we will see that having access to the epipolar
plane images (see section 2.3) of a light field can be very valuable for the analysis of
a captured scene. But, if our light field is sampled with a Plenoptic Camera 2.0 (see
section 2.4.2), this access is not trivial.
Basically, the generation of a Lumigraph representation from a sampled 4D Light Field is
simple – at least using camera arrays [108] (see also sections 2.3, 2.4.3 and 2.4.4) – where
the projective transformations of the views of the individual cameras only have to be
rectified and unified into one epipolar coordinate system requiring a precise calibration
of all cameras.
Due to the optical properties of the micro-lenses – with the image plane of the main lens
defining the epipolar coordinate system – these projective transformations are, in the
case of Focused Plenoptic Cameras, reduced to simple translations [63] of the patches
pbij within each micro-image, given by an offset ~o (see section 2.4.2). Hence, one simply
has to rearrange the viewpoint-dependent rendered views from plenoptic raw data into
the 4D EPI representation (see equation 9).
However, the necessarily small depth of field of the micro-lenses causes other problems.
For most algorithms, the EPI structure can only be effectively evaluated in areas with
high-frequency textures - which of course is only possible for parts of a scene which are
in focus.
Another problem are different focal lengths of the micro-lenses the camera vendor uses
to increase the depth of field [73]. One last, but also most important problem is, that
Focused Plenoptic Cameras suffer from imaging artifacts in out-of-focus areas. Hence,
in order to generate EPIs which can be used to analyze the entire scene at once, we
have to generate the EPIs from all-in-focus (i.e. full depth-of-field) views for each focal
length separately.
3.1
Rendering All In Focus Views Without Pixel-wise Depth
We already discussed some existing methods addressing the topic of rendering all-in-focus
views in section 2.4.2. Now we discuss our contribution based on the publication [104].
To generate plenoptic artifact free Lumigraph (section 2.2, equation 9) from raw data of a
Focused Plenoptic Camera, we need images of all available viewpoints without plenoptic
artifacts and thus we need to render all full depth-of-field images from the raw data.
39
The primary objective in analyzing light fields is to reconstruct the inherently available
depth information since further computations can benefit from available range data.
Here we want to make use of the epipolar plane images (section 2.3) to reconstruct
the scene geometry (section 5.1). Thus the generation of a Lumigraph from plenoptic
raw data is a chicken-and-egg-problem because we already need the depth to render the
all-in-focus views (section 2.4.2) and generate artifact free epipolar plane images.
The computation of full depth-of-field images from a series of views with different image
planes usually requires depth information of the given scene: [15], [73] and [13] applied
a depth estimation based on cross-correlation. The main disadvantage of this approach
is that one would have to solve a major problem, namely the depth estimation for
non-Lambertian scenes, in order to generate the EPI representation, which is intended
to be used to solve the problem in the first place - as already mentioned a classical
chicken-and-egg-problem. To overcome this dilemma, we propose an alternative approach.
We actually do not need to determine the depths explicitly - all we need are the correct
patch sizes δm to ensure a continuous view texturing without plenoptic artifacts.
We propose to find the best δm via a local minimization of the gradient magnitude
pij ) (see figure 11, section 2.4.2) over all possible focal images
at the patch borders b(b
Ωm . Since the effective patch resolution changes with δm , we have to apply a low-pass
filtering to ensure a fair comparison. In practice, this is achieved by downscaling each
patch to the smallest size δmin , using bilinear interpolation. We denote the band-pass
filtered focal images by Ω̄m . Assuming a set of patch sizes ~δ = [δ0 , . . . , δm , . . . , δM ], we
render a set Γ of border images using a Laplacian filter (see figure 14):
~Γ = ∇2 Ω̄0 , . . . , ∇2 Ω̄m , . . . , ∇2 Ω̄M ,
∇2 =
∂2
∂2
+
.
∂x2 ∂y 2
(14)
From Γ, we determine the gradients for each hexagon patch by integrating along its
borders b(b
pij ), considering only gradient directions orthogonal to the edges of the patch
(see figure 14). The norm of the gradients orthogonal to the border of each micro-image
patch pbij and each image plane m is computed as
I
~nb · ∇Γm ds.
(15)
Σ (m, i, j) =
b(pij )
Here, ~nb denotes the normal vector of each hexagon border b (compare figure 11).
Furthermore, we define the lens specific image plane map z [i, j] as a minimum of Σm
for each micro-lens image (i, j)
z (i, j) = argmin Σ [m, i, j] .
(16)
m
The image plane map z (i, j) has a resolution of (Ny , Nx ) (number of micro-lenses) and
encodes the patch size value δij for each microimage (i, j). Using z (i, j), we render
full depth of field views Ω (z). This approach works nicely for all textured regions of
40
a scene. Evaluating the standard deviation of Σ for each (i, j) can serve to further
improve z (i, j): micro-images without (or with very little) texture are characterized by
a small standard deviation in Σ. We use a threshold to replace the affected z (i, j) with
the maximum patch size δmax . This is a valid approach, since the patch size (i.e. the
focal length) does not matter for untextured regions. Additionally, we apply a gentle
median filter to remove outliers from z (i, j).
Figure 14: Left: Part of a raw data image from a Focused Plenoptic Camera and a zoomed part
of it. Right: Three examples of the border image set (see eq. 14) and the gradient magnitude set
(equation. 15) with different patch sizes δ are depicted. The example in the center shows the correct
focal length.
3.2
Merging Views from Different Lens Types
The full depth-of-field views for each lens type have the same angular distribution (if the
same offsets o~q have been used), but are translated relative to each other. We neglect
that these translations are not completely independent of the depth of the scene. Due
to the very small offset (baseline), theses effects are in the order of sub-pixel fractions.
The results shown in figures 6, 7 and 8 are merged by determining the relative shifts Tn
via normalized cross-correlation and averaging over the views with the same offset.
3
1X
Ωmerged (z, ~oq ) =
Tn Ωn (z, ~oq )
3 n=1
Tn ∈ N × N
(17)
Due to the fact that each lens type has an individual focal length, the sharpness of the
results can be improved by a weighted averaging depending on the optimal focal range
41
of each lens type and the information from the focal maps z[i, j]
Ωmerged (z, ~oq ) =
3
X
αn (z) Tn Ωn (z, ~oq ) , αn ∈ R and
n=1
3.3
3
X
αn = 1.
(18)
n=1
The EPI Generation Pipeline
1. View rendering: Rendering of all possible full depth-of-field images Ω(z, ~o) for
different view points of the scene using the focal plane map z (i, j) of optimal
patch sizes and the patch offset vector ~o.
2. View merging: Merging of the corresponding views of different lens types. This
step is only necessary for cameras with several micro-lens types, such as the camera
used in our experiments [72].
3. View stacking: After the merging process, a single set of rendered views remains.
These have to be arranged in a 4D volume according to their view angles resulting
in the EPI structure L(x, y, s, t) (section 2.2, equation 9).
3.4
Results
For the experimental evaluation, we use a commercially available Focused Plenoptic
Camera (the R11 by the camera manufacturer Raytrix GmbH [72]). The camera captures
raw images with a resolution of 10 Mega-pixels and is equipped with an array of roughly
11000 micro-lenses. The effective micro-image diameter is 23 pixels. The array holds
three types of lenses with different focal lengths, nested in a 3 × 65 × 57 hexagon layout,
which leads to an effective maximum resolution of 1495 × 1311 pixels for rendered
projective views at the focal length of the main lens. Due to this setup with different
micro-lens types, we compute the full depth of field view for each lens type independently
and then apply a merging algorithm.
A qualitative evaluation is shown in figure 15. We compare the results of our proposed
algorithm with the output of commercial software from the camera vendor, which
computes the full depth of field projective views via an explicit depth estimation based
on stereo matching on the camera raw data [72]. We present the raw output of both
methods. It should be noted, that the results of the depth estimates are not directly
comparable - the emphasis of our qualitative evaluation lies in the full depth of scene
reconstruction.
42
focal plane estimation
reconstructed Lumigraph
Figure 15: Estimation of the focal length. The left side shows a typically dense image plane map
z [i, j] (see equation 16), computed with our algorithm. On the right, the center views of the reconstructed all-in-focus Lumigraph as well as exemplary extracted epipolar plane images are depicted.
43
camera vendor
rendered all-in-focus view
range estimation
our approach
Figure 16: Top row: Focal plane reconstruction vs. the stereo-based depth reconstruction of the
camera vendor [72]. Bottom row: the all-in-focus rendering of: (left) The proposed method and
(right) the stereo-based method of the camera vendor.
44
4
Data Sets - The 4D Light Field Archive
The driving force for successful algorithm development is the availability of suitable
benchmark datasets with ground truth data in order to compare results and initiate
competition. Light field datasets and, in particular, the type of light fields used in this
work – namely dense sampled 4D Lumigraphs (see section 2.2) – are not yet widely
deployed. There are a few but none of the existing fulfill all of our needs and thus we
decided to establish a new benchmark database.
The current public light field databases we are aware of are the following.
• Stanford Light Field Archive
http://lightfield.stanford.edu/lfs.html
The Stanford Archives provide more than 20 light fields sampled using a camera
array [109], a gantry and a light field microscope [60], but none of the datasets
includes ground truth disparities.
• UCSD/MERL Light Field Repository
http://vision.ucsd.edu/datasets/lfarchive/lfs.shtml
This light field repository [47] offers video as well as static light fields, but there
is also no ground truth depth available, and the light fields are sampled in a
one-dimensional domain of view points only.
• Synthetic Light Field Archive
http://web.media.mit.edu/~gordonw/SyntheticLightFields/index.php
The synthetic light field archive [66] provides many interesting artificial light
fields including some nice challenges like transparencies, occlusions and reflections.
Unfortunately, there is also no ground truth depth data available for benchmarking.
• Middlebury Stereo Datasets
http://vision.middlebury.edu/stereo/data/
The Middlebury Stereo Dataset [43, 82, 83, 84] includes a single 4D light field
which provides ground truth data for the center view, as well as some additional 3D
light fields including depth information for two out of seven views. The main issue
with the Middlebury light fields are that they are designed with stereo matching
in mind, thus the baselines are quite large and not representative for compact light
field cameras and unsuitable for direct epipolar plane image analysis.
While there is a lot of variety and the data is of high quality, we observe that all of the
available light field databases either lack ground truth disparity information or exhibit
large camera baselines and disparities, which is not representative for compact light
field cameras like i.e. Plenoptic Cameras. Furthermore, we believe that a large part of
what distinguishes light fields from standard multi-view images is the ability to treat
the view point space as a continuous domain. There is also emerging interest in light
field segmentation [31, 50, 92, 106], so it would be highly useful to have ground truth
segmentation data available to compare light field labeling schemes. The above data
45
sets lack this information as well.
To alleviate the above shortcomings, we present a new benchmark database which
consists at the moment of 13 high quality densely sampled light fields. The database
offers seven computer graphics generated data sets providing complete ground truth
disparity for all views. Four of these data sets also come with ground truth segmentation
information and pre-computed local labeling cost functions to compare global light field
labeling schemes. Furthermore, there are six real world data sets captured using a single
Nikon D800 camera mounted on a gantry. Using this device, we sampled objects which
were pre-scanned with a structured light scanner to provide ground truth ranges for the
center view. An interesting special data set contains a transparent surface with ground
truth disparity for both the surface as well as the object behind it - we believe it is the
first real-world data set of this kind with ground truth depth available.
4.1
The Light Field Archive
Our light field archive (www.lightfield-analysis.net) is split into two main categories, Blender and Gantry. The Blender category consists of seven scenes rendered
using the open source software Blender [77] and our own light field plug-in, see figure 18
for an overview of the data sets. The Gantry category provides six real-world light fields
captured with a commercially available standard camera mounted on a gantry device,
see figure 19. More information about all the data sets can be found in the overview in
figure 17.
Each data set is split into different files in the HDF5-format [95], exactly which of these
are present depends on the available information. Common to all data sets is a main file
called lf.h5, which contains the light field itself and the range data. In the following,
we will explain its content as well as that of the different additional files, which can be
specific to the category.
4.1.1
The Main File
The main file lf.h5 for each scene consists of the actual light field image data as well as
the ground truth depth, see figure 17. Each light field is 4D, and sampled on a regular
grid. All images have the same size, and views are spaced equidistantly in horizontal
and vertical directions, respectively. The general properties of the light field can be
accessed in the following attributes:
46
dataset name
buddha
horses
papillon
stillLife
buddha2
medieval
monasRoom
couple
cube
maria
pyramide
statue
transparency
category
Blender
Blender
Blender
Blender
Blender
Blender
Blender
Gantry
Gantry
Gantry
Gantry
Gantry
Gantry
resolution
768x768x3
576x1024x3
768x768x3
768x768x3
768x768x3
720x1024x3
768x768x3
898x898x3
898x898x3
926x926x3
898x898x3
898x898x3
926x926x3
GTD
full
full
full
full
full
full
full
cv
cv
cv
cv
cv
2xcv
GTL
yes
yes
yes
yes
no
no
no
no
no
no
no
no
no
Figure 17: Overview of the datasets in the benchmark. dataset name: The name of the dataset.
category: Blender (rendered synthetic dataset) or Gantry (real-world dataset sampled using a
single moving camera). resolution: spatial resolution of the views, all light fields consist of 9x9
views. GTD: indicates completeness of ground truth depth data, either cv (only center view) or full
(all views). A special case is the transparency dataset, which contains ground truth depth for both
background and transparent surface. GTL: indicates if object segmentation data is available.
HDF5 attribute
yRes
xRes
vRes
hRes
channels
vSampling
hSampling
description
height of the images in pixel
width of the images in pixel
# of images in vertical direction
# of images horizontal direction
light field is rgb (3) or grayscale (1)
rel. camera position grid vertical
rel. camera position grid horizontal
The actual data is contained in two HDF5 data sets:
HDF5 dataset
LF
GT DEPTH
size
vRes x hRes x xRes x yRes x channels
vRes x hRes x xRes x yRes
These store the separate images in RGB or gray-scale (range 0-255), as well as the
associated depth maps, respectively.
47
Conversion between depth and disparity. To compare disparity results to the
ground truth depth, the latter has to first be converted to disparity. Given a depth Z,
the disparity or slope of the epipolar lines d in pixels per grid unit is
d=
B∗f
− ∆x,
Z
(19)
where B is the baseline or distance between two cameras, f the focal length in pixel and
∆x the shift between two neighboring images relative to an arbitrary rectification plane
(in case of light fields generated with Blender, this is the scene origin). The parameters
in equation 19 are given by the following attributes in the main HDF file:
B
f
∆x
attribute
dH
focalLength
shift
description
distance between to cameras
focal length
shift between neighboring images
The following sections describe differences and conventions about the depth scale for
the two current categories.
4.1.2
Blender Category
The computer graphics generated scenes consist without exception of ground truth
depth over the entire light field. This information is given as orthogonal distance of
the 3D point to the image plane of the camera, measured in Blender units [BE]. The
Blender main files have an additional attribute camDistance which is the base distance
of the camera to the origin of the 3D scene, and used for the conversion to disparity values.
Conversion between Blender depth units and disparity. The above HDF5
camera attributes in the main file for conversion from Blender depth units to disparity
are calculated from Blender parameters via
dH = b ∗ xRes,
f ov
f ocalLength = 1/ 2 ∗ tan
,
2
1
,
shif t =
2 ∗ Z0 ∗ tan f 2ov ∗ b
(20)
where Z0 is the distance between the Blender camera and the scene origin in [BE],
f ov is the field of view in units radian and b the distance between two cameras in [BE].
Since all light fields are rendered or captured on a regular equidistant grid, it is sufficient
to use only the horizontal distance between two cameras to define the baseline.
48
4.1.3
Segmentation Ground Truth
Some light fields have segmentation ground truth data available, see figure 17, and offer
five additional HDF5 files:
• labels.h5:
This file contains the HDF5 dataset GT LABELS which is the segmentation
ground truth for all views of the light field and the HDF5 dataset SCRIBBLES
which are user scribbles on a single view.
• edge weights.h5:
Contains an HDF5 data set called EDGE WEIGHTS which are probabilities for
edges [106] for all views. These are not only useful for segmentation, but any
algorithm which might require edge information, and can help with comparability
since all of these can use the same reference edge weights.
• feature single view probabilities.h5:
The HDF5 data set Probabilities contains the prediction of a random forest classifier
trained on a single view of the light field without using any feature requiring light
field information [106].
• feature depth probabilities.h5:
The HDF5 data set Probabilities contains the prediction of a random forest classifier
trained on a single view of the light field using estimated disparity [100] as an
additional feature [106].
• feature gt depth probabilities.h5:
The HDF5 data set Probabilities contains the prediction of a random forest
classifier trained on a single view of the light field using ground truth disparity as
an additional feature [106].
4.1.4
Gantry category
In the Gantry category, each scene always provides a single main lf.h5 file, which
contains an additional HDF5 data set GT DEPTH MASK. This is a binary mask indicating valid regions in the ground truth GT DEPTH. Invalid regions in the ground
truth disparity have mainly two causes. First, there might be objects in the scene for
which no 3D data is available, and second, there are parts of the mesh not covered by
the structured light scan and thus having unknown geometry. See section 4.2.2 for details.
A special case is the light field transparency, which has two depth channels for a transparent surface and an object behind it, respectively. Therefore, there also exist two
mask HDF5 data sets, see figure 20. We believe this is the first benchmark light field
for multi-channel disparity estimation.
49
Here, the HDF5 data sets are named:
•
•
•
•
4.2
GT
GT
GT
GT
DEPTH
DEPTH
DEPTH
DEPTH
FOREGROUND,
BACKGROUND,
FOREGROUND MASK,
BACKGROUND MASK.
Generation of the light fields
The process of light field sampling is very similar for both synthetic as well as real world
scenes. The camera is moved on an equidistant grid parallel to its own sensor plane and
an image is taken at each grid position. Although not strictly necessary, an odd number
of grid positions is used for each movement direction as there then exists a well-defined
center view which makes the processing simpler. An epipolar rectification on all images
is performed to align individual views to the center one. The source for the internal and
external camera matrices needed for this rectification depends on the capturing system
used.
4.2.1
Blender category
For the synthetic scenes, the camera can be moved using a script for the Blender engine.
As camera parameters can be set arbitrarily and the sensor and movement plane coincide
perfectly, no explicit camera calibration is necessary. Instead, the values required for
rectification can be derived directly from the internal Blender settings.
4.2.2
Gantry category
For real-world light fields, a Nikon D800 digital camera is mounted on a stepper-motor
driven gantry manufactured by Physical Instruments. A picture of the setup can be seen
in figure 13. Accuracy and repositioning error of the gantry is well in the micrometer
range. The capturing time for a complete light field depends on the number of images,
about 15 seconds are required per image. As a consequence, this acquisition method is
limited to static scenes. The internal camera matrix must be estimated beforehand by
capturing images of a calibration pattern and invoking the camera calibration algorithms
of the OpenCV library [17], (see next section for details). Experiments have shown that
the positioning accuracy of the gantry actually surpasses the pattern based external
calibration as long as the differences between the sensor and movement planes are kept
minimal.
Ground Truth for the Gantry Light Fields. This section is the work of Stephan
Meister [68]. Ground truth for the real world scenes was generated using standard
pose estimation techniques. First, we acquired 3D polygon meshes for an object in
the scene using a Breuckmann SmartscanHE structured light scanner. The meshes
contain between 2.5 and 8 Million faces with a stated accuracy of down to 50 micron.
50
The object-to-camera pose was estimated by hand-picking 2D-to-3D feature points
from the light field center view and the 3D mesh, and then calculating the external
camera matrix using an iterative Levenberg-Marquardt approach from the OpenCV
library [17]. This method is used for both the internal and external calibration. An
example set of correspondence points for the scene pyramide can be observed in figure 21.
The re-projection error for all scenes was typically 0.5 ± 0.1 pixels. The depth is then
defined as the distance between the sensor plane and the mesh surface visible in each
pixel. The depth projections are computed by importing the mesh and measured camera
parameters into Blender and performing a depth rendering pass. At depth discontinuities (edges) or due to the fact that the meshes’ point density is higher than the lateral
resolution of the camera, one pixel can contain multiple depth cues. In the former case,
the pixel was masked out as an invalid edge pixel and, in the latter case, the depth of
the polygon with the biggest area inside the pixel was selected. The error is generally
negligible as the geometry of the objects is sufficiently smooth at these scales. Smaller
regions where the mesh contained holes were also masked out and not considered for
the final evaluations.
For an accuracy estimation of the acquired ground truth, we perform a simple error
propagation on the projected point coordinates. Given an internal camera matrix C
and an external matrix R, a 3D point P~ = (X, Y, Z, 1) is projected onto the sensor pixel
(u v) according to
 
 
X
u
Y 
v  = C R   .
Z 
1
1
For simplicity, we assume that the camera and object coordinate systems coincide, save
for an offset tz along the optical axis. Given focal length fx , principal point cx and
re-projection error ∆u, this yields for a pixel on the v = 0 scan-line
tz = Z −
fx X
,
u − cx
resulting in a depth error ∆tz of
∆tz =
fx X
∂tz
∆u.
∆u =
∂u
(cx − u)2
Calculations for pixels outside of the center-scan line are performed analogously. The
error estimate above depends on the distance of the pixel from the camera’s principal
point. As the observed objects are rigid, we assume that the distance error ∆tz between
camera and object corresponds to the minimum observed ∆tz among the selected 2D-3D
correspondences. For all gantry scenes, this value is in the range of 1mm so we assume
this to be the approximate accuracy of our ground truth.
51
papillon
stillLife
horses
depth ground truth + segmentation ground truth
buddha
mona
medieval
depth ground truth
buddha2
Figure 18: Data sets in the category Blender. Top: Light fields with segmentation information
available. From left to right: buddha, papillon, stillLife, horses. First row shows center view, second
depth ground truth and third label ground truth. Bottom: Light fields without segmentation information. From left to right: buddha2, monasRoom, medieval. First row shows center view, the second
the depth ground truth.
52
gt depth
gt validity
statue
pyramide
maria
cube
couple
center view
Figure 19: Data sets in the category Gantry. From left to right: center view, depth channel, mask
which indicates regions with valid depth information. The ordering of the data sets is the same as in
figure 17.
53
center view
ground truth
gt mask
Figure 20: Data set transparency. Left: center view, middle top: depth of the background, middle
bottom: depth of the foreground, right top: background mask for valid depth ground truth pixel,
right bottom: foreground mask for valid depth ground truth pixel.
Figure 21: Selected 2D correspondences for pose estimation for the
pyramide dataset. In theory, four points are sufficient to estimate the
six degrees of freedom of an external camera calibration matrix, but
more points increase the accuracy in case of outliers.
54
5
5.1
Orientation Analysis in Light Fields
Single Orientation Analysis
A main benefit of light fields compared to traditional images or stereo pairs is the
expansion of the disparity space to a continuous space. This becomes apparent when
considering epipolar plane images (section 2.3), which can be viewed as 2D slices of
constant angular and spatial direction through the Lumigraph (section 2.2). Due to a
dense sampling in angular direction, corresponding pixels are projected onto lines in
EPIs, which can be detected more robustly and faster than point correspondences.
EPIs were introduced to the analysis of scene geometry by Bolles et al. [16]. They detect
edges, peaks and troughs with a subsequent line fitting in the EPI to reconstruct 3D
structure. Later, Baker used zero crossings of the Laplacian [6, 7]. Another approach is
presented by Criminisi [27], who use an iterative extraction procedure for collections of
EPI-lines of the same depth, which they call an EPI-tube. Lines belonging to the same
tube are detected via shearing the EPI and analyzing photo-consistency in the vertical
direction. They also propose a procedure to remove specular highlights from already
extracted EPI-tubes.
There are also two less heuristic methods which work in an energy minimization framework. In Matousek et al. [67], a cost function is formulated to minimize a weighted
path length between points in the first and the last row of an EPI, preferring constant
intensity in a small neighborhood of each EPI-line. However, their method only works
in the absence of occlusions.
Berent et al. [11] deal with the simultaneous segmentation of EPI-tubes by a region
competition method using active contours, imposing geometric properties to enforce
correct occlusion ordering.
In contrast to the above works, we propose a local gradient based orientation analysis of
the EPIs and additionally can perform a labeling for all points in the EPI simultaneously
by using a state-of-the-art continuous convex energy minimization framework. We
enforce globally consistent visibility across views by restricting the spatial layout of the
labeled regions.
Compared to methods of Bolles [16] and Criminisi [27] which extract EPI information
sequentially, this is independent of the order of extraction and does not suffer from an
associated propagation of errors. While a simultaneous extraction is also performed by
Berent et al. [11], they perform local minimization only and require good initialization,
as opposed to our convex relaxation approach. Furthermore, they use a level set approach, which makes it expensive and cumbersome to deal with a large number of regions.
In this section we propose a range estimation approach using a 4D light field parametrized
55
as Lumigraph (see section 2.2 and equation 9). The basic idea is as follows. We first compute local slope estimates on epipolar plane images for the two different slice directions
(x, s)-slice Σy∗ ,t∗ , (y, t)-slice Σx∗ s∗ (section 2.3) using the structure tensor (section 5.1.1).
This gives two local disparity estimates for each pixel in each view. These can be merged
into a single disparity map in different ways: just locally choosing the estimate with
the higher reliability, optionally smoothing the result (which is very fast), or solving
a global optimization problem (which is slow). In the experiments, we will show that,
fortunately, the fast approach leads to estimates which are slightly more accurate. The
content in this section is published in Wanner et al. [100], [103] whereby the theory as
well as fast GPU implementations [37] of the optimization techniques are the work of
Bastian Goldlücke
5.1.1
The Structure Tensor
A common technique to estimate orientations is the structure tensor introduced by Bigun et al. [12]. Derivations below follow the chapter ”The Structure Tensor” in Jähne [45].
If we assume a unit vector n ∈ RD as the preferred local orientation of the gray value
changes of a function g : Ω → R, Ω ⊂ RD , the following must be satisfied:
∇g T n
2
= |∇g|2 cos2 (∢ (∇g, n)) .
(21)
This will become a maximum if ∇g is parallel to n or if ∇g is anti-parallel to n and
zero if ∇g is orthogonal to n. Thus we need to maximize the following expression in a
local environment
Z
2
w(x − x′ ) ∇g(x′ )T n dD x′ ,
(22)
where w is a window function determining size and shape of the average region around
x. Equation (22) can be reformulated to
nJn → maximum
J=
Z
w(x − x′ )(∇g(x′ )∇g(x′ )T )dD x′ .
which results in a symmetric D × D tensor
Z ∞
∂g(x′ ) ∂g(x′ )
′
d D x′ .
w(x − x )
Jpq (x) =
′
′
∂x
∂x
−∞
p
q
(23)
(24)
(25)
An Eigenvalue decomposition of J, in case of D = 2, gives two Eigenvalues λ1 ,λ2 .
Without limiting the generality, we assume that λ1 > λ2 . Due to the orthogonality of
56
condition
λ 1 = λ2 = 0
λ1 > 0, λ2 = 0
λ1 > 0, λ2 > 0
rank
0
1
2
meaning
const. local environment
ideal local orientation
isotropic environment
Table 1: The table shows the meaning of different Eigenvalue conditions of the structure tensor for 2D images
the Eigenvectors v1 ,v2 it is obvious that v1 is parallel to n and v2 is anti-parallel to n.
The relationship between the Eigenvalues give a quality measure of the local orientation
pattern. Jähne [45] defines the coherence c, which varies between zero for isotropic
structures and one for ideal orientations:
p
2
(J11 − J22 )2 + 4J12
λ 1 − λ2
=
, c ∈ [0, 1]
(26)
c=
J11 + J22
λ1 + λ2
Implementation. In general the computation of the structure tensor consists of
four steps. An initial (Gaussian) smoothing to reduce noise and high frequencies, the
gradient computation, the computation of the structure tensor components, and a final
(Gaussian) smoothing of these components. A widely used approach to compute the
gradients is the so-called Sobel-operator [75].




1
2
1
1 0 −1
0
0 .
(27)
Sx = 2 0 −2 , Sy =  0
−1 −2 −1
1 0 −1
Another approach to compute

3

Sx = 10
3
the gradients is the Scharr-operator [81].



3
10
3
0 −3
0
0 .
0 −10 , Sy =  0
−3 −10 −3
0 −3
(28)
Scharr optimized the filter coefficients to guarantee an optimal rotational symmetry,
leading to much better orientation estimations compared to the Sobel-operator. In this
work we use a variant of the structure tensor combining the initial smoothing step
and the gradient computation using a Gaussian derivative filter as implemented in
the VIGRA Computer Vision Library [49]. We discuss the reason for this choice in
section 5.1.2.2 and figure 23.
The definition of Gaussian filter is as follows
Gσ,n (x) =
1 − x21 +x2 22
∂n
e 2σ
n
∂xn1 x ∂x2 y 2πσ 2
57
(29)
whereas n = nx + ny is the order of derivative and σ > 0 the standard deviation of the
Gaussian. The kernel radius rK is computed through
1
rK = 3σ + n.
2
The result is rounded to the next higher integer. The Kernel size δK then is
δK = 2rK + 1.
(30)
(31)
The gradient of an image I is defined as
∇I = (Sx,σ , Sy,σ ) =
Gσ,1 (I)|nx = 1
.
Gσ,1 (I)|ny = 1
(32)
The algorithm to compute the structure Tensor Jσ,ρ on an gray-scale image I is then as
follows:
1. compute the gradients Sx,ρ , Sy,ρ
2. compute the structure tensor components
Gσ,0 (Sx,ρ Sx,ρ ) Gσ,0 (Sx,ρ Sy,ρ )
Jρ,σ (I) =
Gσ,0 (Sy,ρ Sx,ρ ) Gσ,0 (Sy,ρ Sy,ρ )
5.1.2
(33)
Disparities On Epipolar Plane Images
5.1.2.1 Local Disparity Estimation We first consider how we can estimate the
local direction of a line at a point (x, s) in an epipolar plane image Sy∗ ,t∗ (see section 2.3),
where y ∗ and t∗ are fixed. The case of vertical slices is analogous. The goal of this step
is to compute a local disparity estimate dy∗ ,t∗ (x, s) for each point of the slice domain, as
well as a reliability estimate ry∗ ,t∗ (x, s) ∈ [0, 1] (eq. 38), which is the coherence of the
structure tensor (eq. 26) and gives a measure of how reliable the local disparity estimate
is. Both local estimates will be used in subsequent sections to obtain a consistent
disparity map in a global optimization framework.
In order to obtain the local disparity estimate, we need to estimate the direction of lines
on the slice. This is done using the structure tensor J (see eq. 33) of the epipolar plane
image S = Sy∗ ,t∗ ,
Jxx Jxy
Jρ,σ (S) =
.
(34)
Jxy Jyy
The direction of the local level lines can then be computed via Bigun et al. [12]


Jyy −Jxx
1
)
sin( 2 arctan
∆x
2Jxy  ,
=
ny∗ ,t∗ =
∆s
cos( 1 arctan Jyy −Jxx )
2
58
2Jxy
(35)
(a) optimal structure tensor parameters
(b) grid search on dataset buddha
Figure 22: Using grid search, we find the ideal structure tensor parameters over a range of both
angular and spatial resolutions (a). Blue colored data points show the optimal outer scale, red
points the optimal inner scale. The thick streaks are added only for visual orientation. In (b) an
example of a single grid search is depicted. Colour-coded is the amount of pixels with a relative
error to the ground truth of less than 1%, which is the target value to be optimized for in (a).
from which we derive the local depth estimate via
Z = −f
∆s
.
∆x
Frequently, a more convenient unit is the disparity
f
∆x
1
Jyy − Jxx
dy∗ ,t∗ = =
,
= tan
arctan
Z
∆s
2
2Jxy
(36)
(37)
which describes the pixel shift of a scene point when moving between the views. We
will usually use disparity instead of depth in the remainder of this work. According
to equation 26 as the natural reliability measure we use the coherence of the structure
tensor
q
2
(Jyy − Jxx )2 + 4Jxy
ry∗ ,t∗ :=
.
(38)
(Jxx + Jyy )
Using the local disparity estimates dy∗ ,t∗ , dx∗ ,s∗ and reliability estimates ry∗ ,t∗ , rx∗ ,s∗ for
all the EPIs in horizontal and vertical directions, respectively, one can now proceed to
directly compute disparity maps in a global optimization framework, which is explained
in section 5.1.3.2. However, it is possible to first enforce global visibility constraints
separately on each of the EPIs, which we explain in section 5.1.2.3.
59
5.1.2.2 Limits of the Local Orientation Estimation In the following, we will
perform a detailed evaluation of the orientation analysis by applying the structure tensor
on synthetically generated epipolar planes. These EPIs are initialized with random
stripe patterns of parallel lines. Random in this context means that we vary the stripe
thickness and the assigned gray-scale to simulate the structure of real epipolar plane
images. Below we list the parameters to control the generated EPI appearance in our
experiments.
• h: height or number of pixels in y direction representing the number of cameras.
• d: the pixel shift or slope of the epipolar lines, equivalent to the disparity in real
light fields. The EPIs are initialized with a disparity of zeros which can be changed
by applying affine transformations with sub-pixel accuracy simulating a refocusing
or change in depth.
• σn : the noise level. We add random Gaussian noise with a standard deviation of
σn [px] to the images.
• wmax : maximum width of the epipolar lines, whereby width means the number of
pixels of a line in the x-direction having assigned the same intensity value. This
simulates low- or non-textured regions in the image domain.
• δc : the color variance. Epipolar lines have a random intensity value of 128 ± δc
simulating low contrasts.
The general procedure of the experiments is:
1. generate an EPI with a random stripe pattern with respect to the parameters
described above.
2. evaluate orientation on the EPI generated in step 1.
3. extract the estimated disparity from the center row h/2 (neglecting 10% of the
pixels at the left and right borders to avoid border artifacts) and calculate the
mean over the remaining pixels.
4. change the disparity d of the EPI by applying sub-pixel shifts on each row of the
epipolar plane image
5. compute the disparity deviation δd between the evaluated dm and the ground
truth disparity d: δd = d − dm .
6. repeat steps 2 to 4 over the desired parameter range evaluated in the experiment.
7. repeat steps 1 to 5 a number of N times and return the mean disparity deviation
to achieve statistically reliable results over N randomly generated orientation
patterns. A value of N = 200 showed to be suitable to stabilize the results. This
step is included in all of our experiments without explicitly being mentioned every
time.
60
Our goal with these experiments is to show the ideal behavior and theoretical abilities
of the orientation estimation on synthetic epipolar plane images. The results do not
allow to draw conclusions about the behavior on real epipolar planes, they do not cover
effects like continuously changing disparity related to non planar objects, occlusions or
non-Lambertian effects. However they give an insight into the raw orientation estimation
ability of the structure tensor and its behavior under certain conditions.
Comparison of gradient filters. The first experiment motivates our choice to
compute the gradients of the structure tensor using the Gaussian derivatives filter in
equation 29. We compute the structure tensor in three different variants, using the Sobel operator (eq. 27), the Scharr -operator (eq. 28) and the Gaussian derivative operator
(eq. 29) and compare their behavior on synthetic EPIs. We use epipolar planes with
h = 15px, σn = 0 and δc = 128. On the one hand, we are interested in the orientation
estimation accuracy of all variants, but also in the robustness against untextured regions
in the EPIs. As a reminder, we obtain an EPI when fixing a row/column index in
the image domain and stack them over a collection of images of different viewpoints.
This means the texture of the objects along this rows/columns is mapped into the
epipolar space as lines whose slope corresponds to the distance of the object to the
camera (compare section 2.3). As a result, the thickness of an epipolar line depends
on the intensity variance of texture mapped onto the epipolar space. We simulate this
in our experiment by generating random EPIs with epipolar lines of random widths
with a maximum width of wmax . To evaluate the accuracy in orientation estimation we
compute the structure tensor on an EPI with a fixed wmax over an orientation range
from d = [−1, 1] by applying affine transformations to the EPI to create the different
slopes with sub-pixel accuracy. The accuracy is then computed as mentioned in step 3
of the general procedure above as the mean over all orientations. We evaluate this mean
error over the whole orientation range for values of wmax = [2, 19] and plot the result in
figure 23 (a) with the mean orientation estimation error over the maximum epipolar
line width. The result is that the Scharr-operator leads to more accurate orientation
estimations but due to the extensible kernel size, the overall performance of the Gaussian
derivative filter is more robust against increasing epipolar line widths or, in other words,
against the presence of decreasing frequencies in the EPIs.
Optimal structure tensor scales. In the next experiment we use epipolar planes
with h = 15px, wmax = 4, σn = 0, d = [−1, 1] and δc = 128. We vary the disparity
d from −1px to 1px in 0.1px steps and compute for each disparity the orientation on
a parameter grid of the inner scale ρ and outer scale σ of the structure tensor. We
varied ρ from 0.4 to 0.9 in steps of 0.01 and σ from 0.6 to 2.5 in 0.01 steps as well.
The result is depicted in figure 23. Outer and inner scale behave nearly constant. The
stronger variations in the outer scale signal are primarily due to a lesser sensitivity of
the outer scale in a wider range – thus some randomness in the exact position of the
absolute minimum occurs (compare also figure 25). The slight indentation at disparity
zero for both signals can be explained by the fact that orientation estimation works
61
perfect for vertical lines. Due to this experiment we define in later experiments an
inner scale ρ = 0.75 and an outer scale σ = 1.0 as optimal scales for this epipolar plane
configuration.
a) comparison of gradient filters
b) optimal structure tensor scales
Figure 23: a) Comparison of gradient filters. On synthetic epipolar pane images we evaluate the
structure tensor using three different approaches to compute the gradients (Sobel,Scharr,Gaussian
derivative). We create synthetic EPIs with random epipolar lines of maximum widths wmax , where
width means the number of pixels of a line in x direction having assigned the same intensity value.
We vary wmax , drawn on the x-axis, and compute for each wmax the orientations over a slope range
d = [−1, 1]. The average of the estimation errors for each gradient filter is drawn on the y-axis.
We see that the Scharr-Operator has the best orientation estimation abilities, but with increasing
untextured regions the Gaussian derivative filter shows a better overall performance due to the fact
that its kernel size is not restricted to 3 × 3. b) Here, we evaluate the optimal scale parameter of
the structure tensor. In this experiment we generate synthetic EPIs and compute for each slope in
the range of d = [−1, 1] the inner and outer scale using a grid search and plot the resulting scale
parameter with the lowest estimation error for each slope. They behave more or less constant over
the range of slopes, the slight indentation at disparities 0 and ±1 can be explained by the fact that
orientation estimation works ideal for vertical and horizontal lines even with small kernel sizes.
Inner and outer scale limits. Using the optimal scale parameter of the structure
tensor from our second experiment, we will now look what happens when we fix one scale
and vary the other to see the operative range of the corresponding second parameter.
Again we use epipolar planes with h = 15px, wmax = 4, σn = 0, d = [−1, 1] and δc = 128.
Results are depicted in figures 24 and 25. The first show results for a constant σ varying
ρ, the second is the opposite. We observe that the inner scale ρ has much narrower
tolerances than σ.
Minimal number of cameras. In the next experiment we change the parameter h
of an epipolar plane which is equivalent to the number of cameras or sampling steps used
62
3D view
top view
Figure 24: Inner Gaussian kernel variation when fixing the outer scale σ to 1.0. Left side shows a
3D view of the disparity deviation d − dm on the z axis computed over the disparity d and the inner
scale ρ. The right side is the left plot viewed from above. It is obvious that the region of minimal
disparity deviation ρopt ≈ [0.7, 0.8] is quite narrow.
to acquire the light field. The other parameters are wmax = 4, σn = 0, d = [−1.5, 1.5]
and δc = 128. Result are depicted in figure 26 and show that a number of 7 cameras
seems optimal for the method. This can be explained with error diffusion caused by
border effects. To calculate the structure tensor we need to apply three convolutions. If
we use the minimal kernel size of 3 × 3 each convolution diffuses an error – caused by
the image borders – one pixel towards the center row. This adds up to 2 · 3 + 1 = 7 pixel
if we want a center row pixel which is unaffected by border errors. In this experiment
we adapted the kernel sizes for structure tensor evaluation to use the EPI height h in an
optimal fashion to see if bigger kernel size leads to more and more increasing estimation
results. But we see in figure 26 that the estimation accuracy above h = 11px does not
increase anymore. Therefore, we propose that at least 7 cameras are necessary and
more than 11 superfluous. These statement is only fully valid if one is only interested
in an optimal estimation for the center view of a light field, which is equivalent in this
experiments to only taking the center row into account. If an optimal depth estimation
for more than the center view is desired, an increasing number of cameras can be useful.
Influence of noise on accuracy and coherence. A 1D plot for h = 7px, inner and
outer kernels of σ = 0.75 and ρ = 1.0 is depicted in figure 27. The left side shows the
evaluation on a noise-free EPI and the right side an EPI with a noise level σn = 11. The
plots show the coherence and the disparity deviation with its standard deviation. We
observe that the orientation analysis seems to work perfectly for disparities ±1 and 0
and is worse for disparities ±0.5. However, the overall accuracy is in the range 0.01 px.
The error increases quickly if the slope of the EPI lines goes beyond 45◦ . This is quite
63
3D view
top view
Figure 25: Outer Gaussian kernel variation when fixing the inner scale ρ to 0.75. Left side shows a
3D view of the disparity deviation d − dm on the z axis computed over the disparity d and the outer
scale σ. The right side is the left plot viewed from above. It is obvious that the region corresponding
to a minimal disparity deviation σopt ≈ [0.5, 1.3] is much wider than for the inner scale.
clear when realizing that the incline of a line on epipolar planes is caused by horizontal
shifts of the image rows instead of a rotation. This of course leads to a disruption of the
line above 45◦ . Adding noise (figure 27 right) leads to an increasing uncertainty but not
to be affecting the overall accuracy that much.
Noise and contrast variation. In two more experiments, depicted in figure 28, we
further check the sensitivity to noise and to contrast changes. The results are that the
structure tensor is quite robust against decreasing contrast. Also, noise up to a certain
amount does not affect the overall accuracy that much but increases the uncertainty of
the estimation leading to noisy results.
64
3D view
top view
Figure 26: Variation of the EPI height or the number of cameras. We varied the disparity d of an
epipolar plane image from -1.5 to 1.5 as well as h from 4px up to 14px. The z-axis is the deviation
of the measured disparity from the ground truth disparity d − dm . The measured disparity dm
is calculated using the method described in section 5.1.2.1 whereby the kernel size and standard
deviation of the outer Gaussian kernel was adapted to the actual height h for each EPI to make
use of the increasing EPI line length. As a result we see that 7px seems to be the first EPI height
covering the entire range of ±1px with acceptable accuracy. We also see that above h = 11px the
accuracy does not further increase.
example epi d=1
example noisy epi d=1
disparity deviation & coherence over d
disparity deviation & coherence over d
Figure 27: Comparison of orientation analysis on noise free and noisy epipolar plane images. Left
side shows an EPI of height h = 9px. The smoothing parameters are inner scale ρ = 0.75 and outer
scale of σ = 1.0. Same parameters on the right side but with an additive Gaussian noise of σn = 31.
Plotted are the disparity deviation d − dm with standard deviation and the coherence or reliability
r. It is obvious that noise does not affect the mean accuracy that much but the certainty is affected
through a much lower coherence.
65
contrast
noise
Figure 28: Top: Variation of the image contrast. We generate random EPI lines with random
integer intensities (128 − δc ,128 + δc ) where δc ∈ [1, 128]. We varied the color contrast δc and the
disparity to compute the disparity deviation d − dm . The results of the orientation estimation are
contrast independent up to very little contrasts. Only at the lowest contrast of ±2px do we see
significant outliers. Bottom: Noise variation. We see an evaluation of the disparity deviation under
increasing additive Gaussian noise σn ∈ [0, 31].
66
5.1.2.3 Consistent Disparity Labeling The computation of the local disparity
estimates using the structure tensor only takes into account the immediate local structure
of the light field. In truth, the disparity values within a slice need to satisfy global
visibility constraints across all cameras for the labeling to be consistent. In particular, a
line which is labeled with a certain depth cannot be interrupted by a transition to a label
corresponding to a greater depth, since this would violate occlusion ordering, figure 29.
In the conference paper [100], a joint work of Dr Goldlücke and the author, we have
(a) label relations
(b) disparity estimation
Figure 29: (a) Global labeling constraints on an EPI: if depth λi is less than λj and corresponds
to direction ni , then the transition from λi to λj is only allowed in a direction orthogonal to ni to
not violate occluding order. (b) With the consistent labeling scheme one can enforce global visibility
constraints in order to improve the depth estimates for each epipolar plane image.
shown that by using a variational labeling framework based on ordering constraints [93],
one can obtain globally consistent estimates for each slice which take into account all
views simultaneously. While this is a computationally very expensive procedure, it yields
convincing results, see figure 29. In particular, consistent labeling greatly improves
robustness to non-Lambertian surfaces, since they typically lead only to a small subset
of outliers along an EPI-line. However, at the moment this is only a proof of concept,
since it is far too slow to be usable in any practical applications. For this reason, we do
not pursue this method further in this work, and instead evaluate only the interactive
technique, using results from the local structure tensor computation directly.
5.1.3
Disparities On Individual Views
After obtaining EPI disparity estimates dy∗ ,t∗ and dx∗ ,s∗ from the horizontal and vertical slices, respectively, we integrate those estimates into a consistent single disparity
map u : Ω → R for each view (s∗ , t∗ ). This is the objective of the following section.
5.1.3.1 Fast Denoising Scheme Obviously, the fastest way to obtain a sensible
disparity map for the view is to just point-wise choose the disparity estimate with the
higher reliability rx∗ ,s∗ or ry∗ ,t∗ , respectively. We can see that it is still quite noisy,
furthermore, edges are not yet localized very well, since computing the structure tensor
67
(a) Accuracy depending on angular resolution (b) Mean error depending on disparity for
dataset buddha
Figure 30: Analysis of the error behaviour from two different points of view. In a), we plot the
percentage of pixels which deviate from the ground truth (gt) by less than a given threshold over
the angular resolution. Very high accuracy (i.e. more than 50% of pixels deviate by less than 0.1%)
requires an angular resolution of the light field of at least 9 × 9 views. In b), we show the relative
deviation from ground truth over the disparity value in pixels per angular step. Results were plotted
for local depth estimations calculated from the original (clean) light field, local depth estimated
from the same light field with additional Poisson noise (noisy) as well as the same result after TV-L1
denoising, respectively. While the ideal operational range of the algorithm are disparities within ±1
pixel per angular step, denoising significantly increases overall accuracy outside of this range.
entails an initial smoothing of the input data. For this reason, a fast method to obtain
quality disparity maps is to employ a TV-L1 smoothing scheme, where we encourage
discontinuities of u to lie on edges of the original input image by weighting the local
smoothness with a measure of the edge strength. We use
g(x, y) = 1 − rs∗ ,t∗ (x, y),
(39)
where rs∗ ,t∗ is the coherence measure for the structure tensor of the view image, defined
similarly as in (38). Higher coherence means a stronger image edge, which thus increases
the probability of a depth discontinuity.
We then minimize the weighted TV-L1 smoothing energy
Z
1
E(u) =
g |Du| +
|u − f | d(x, y),
2λ
Ω
(40)
where f is the noisy disparity estimate and λ > 0 a suitable smoothing parameter. The
minimization is implemented in the open-source library cocolib [37] by Dr. Goldlücke
and performs in real-time.
68
5.1.3.2 Global Optimization Scheme From a modeling perspective, a more sophisticated way to integrate the vertical and horizontal slice estimates is to employ a
globally optimal labeling scheme in the domain Ω, where we minimize a functional of
the form
Z
g |Du| + ρ(u, x, y) d(x, y).
(41)
E(u) =
Ω
In the data term, we want to encourage the solution to be close to either dx∗ ,s∗ or dy∗ ,t∗ ,
while suppressing impulse noise. Also, the two estimates dx∗ ,s∗ and dy∗ ,t∗ shall be
weighted according to their reliability rx∗ ,s∗ and ry∗ ,t∗ . We achieve this by setting
ρ(u, x, y) := min(ry∗ ,t∗ (x, s∗ ) |u − dy∗ ,t∗ (x, s∗ )| ,
rx∗ ,s∗ (y, t∗ ) |u − dx∗ ,s∗ (y, t∗ )|).
(42)
We compute globally optimal solutions to the functional (41) using the technique of
functional lifting described in [74], which is also implemented in cocolib [37]. While
being more sophisticated modeling-wise, the global approach requires minutes per view
instead of being real-time, and a discretization of the disparity range into labels, which
might even lead to a loss instead of gain in accuracy.
5.1.4
Performance Analysis for Interactive Labeling
In this section, we perform detailed experiments with the local disparity estimation
algorithm to analyze both quality and speed of this method. The aim is to investigate
how well our disparity estimation paradigm performs when the focus lies on interactive
applications, as well as find out more about the requirements regarding light field
sampling and the necessary parameters.
Optimal Parameter Selection. In a first experiment, we establish guidelines to
select optimal inner and outer scale parameters of the structure tensor. As a quality
measurement, we use the percentage of depth values below a relative error
ǫ = |u(x, y) − r(x, y)|/r(x, y)
(43)
where u is the depth map for the view and r the corresponding ground truth. Optimal
parameters are then found with a simple grid search strategy, where we test a number of
different parameter combinations. Results are depicted in figure 22, and determine the
optimal parameter for each light field resolution and data set. Following evaluations are
all done with these optimal parameters. In general, it can be noted that an inner scale
parameter of 0.08 is always reasonable, while the outer scale should be chosen larger
with larger spatial and angular resolution to increase the overall sampling area. Here, it
could be noted that applying median filtering to the results will reduce the outer-scale
parameter behavior in figure 22 which causes a better edge preserving in the results but
this is of course linked with higher computational cost. Here we did the experiments
69
(a)
(b)
(c)
(d)
(e)
Figure 31: Results of disparity estimation on the datasets Buddha (top), Mona (center) and Conehead (bottom). (a) shows ground truth data, (b) the local structure tensor disparity estimate described in section 5.1.2.1 and (c) the result after TV-L1 denoising according to section 5.1.3. In (d)
and (e), one can observe the amount and distribution of error, where green labels mark pixels deviating by less than the given threshold from ground truth, red labels pixels which deviate by more.
Most of the larger errors are concentrated around image edges.
without any post-processing to show the raw ability of the local method.
Minimum Sampling Density. In a second step, we investigate what sampling density we need for an optimal performance of the algorithm on light fields instead of
synthetically created EPIs (compare section 5.1.2.2 and figure 26). To achieve this, we
evaluated three simulated light fields over the full angular resolution range with the
optimal parameter selection found in figure 22. The results are illustrated in figure 30,
and show that for very high accuracy, i.e. less than 0.1% deviation from ground truth,
we require about nine views in each angular direction of the light field.
Moreover, the performance degrades drastically when the disparities become larger than
around ±1 pixels, which makes sense from a sampling perspective since the derivatives
in the structure tensor are computed on a 3 × 3 stencil. Together with the characteristics
of the camera system used (baseline, focal length, resolution), this places constraints
on the depth range where we can obtain estimates with our method. For the Raytrix
70
Plenoptic Camera we use in the later experiments, for example, it turns out that we can
reconstruct scenes which are roughly contained within a cube-shaped volume, whose
size and distance is determined by the main lens we choose.
Noisy Input. Another interesting fact is observable on the right hand side of figure 30,
where we test the robustness against noise (compare also figure 27). Within a disparity
range of ±1, the algorithm is very robust, while the results quickly degrade for larger
disparity values when impulse noise is added to the input images. However, when we
apply TV-L1 denoising, which requires insignificant extra computational cost, we can
see that the deviation from ground truth is on average reduced below the error resulting
from a noise-free input. Unfortunately, denoising always comes at a price: since it
naturally incurs some averaging, while accuracy is globally increased, some sub-pixel
details can be lost. In figure 31 we observe the distribution of the errors, and can see
that almost all large-scale error is concentrated around depth discontinuities.
Disparity Range Limitation. The histogram plot on the right in figure 30 depicts
the effect of rapidly increasing orientation estimation errors if the disparities exceed
a range of ±1. The reason is that the slope of an epipolar line depends on shifts in
the image domain relative to the center view (see figure 32). The pixels of an epipolar
line with a slope > | ± 1| are torn apart and thus cannot be matched as a line anymore
using convolution operations. Reconstructing scenes with disparity ranges above 2 pixels
mr =3
mb=1
mb=-1
mr =1
shift 2px
mr =-0,5
mb=-2.5
shift 3.5px
Figure 32: Visualization of a refocusing operation on the EPI domain. The left image sketches an
EPI with a red and a blue epipolar line initially having slopes of mr = 3 and mb = 1 respectively.
By shifting the rows of the EPI opposing with respect to the center view the slope of each line
changes like depicted in the middle and right image.
makes an iterative processing necessary. One simply repeats the following steps over the
entire disparity range:
• refocus the light field.
• compute orientations.
• store valid disparities between ±1.
• add total pixel shift to the disparities.
Merging the corresponding results from each iteration step can be done for example by
choosing the disparity with the highest coherence (see eq. 38).
71
5.1.5
Comparison to Multi-View Stereo
We compute a simple local stereo matching cost for a single view as follows. Let
V = {(s1 , t1 ), ..., (sN , tN )} be the set of N view points with corresponding images
I1 , ..., IN , with (sc , tc ) being the location of the current view Ic for which the cost
function is being computed. We then choose a set Λ of 64 disparity labels within an
appropriate range. For our test we choose equidistant labels within the ground truth
range for optimal results. The local cost ρAV (x, l) for label l ∈ Λ at location x ∈ Ic
computed on all neighboring views is then given by
X
ρAV (x, l) :=
min(ǫ, kIn (x + lvn ) − Ic (x)k),
(44)
(sn ,tn )∈V
where vn := (sn − sc , tn − tc ) is the view point displacement and ǫ > 0 is a cap on
the error to suppress outliers. To test the influence of the number of views, we also
compute a cost function on a crosshair of view points along the s- and t-axis from the
view (sc , tc ), which is given by
X
ρCH (x, l) :=
kIn (x + lvn ) − Ic (x)k .
(45)
(sn ,tn )∈V
sn =sc or tn =tc
In effect, this cost function thus uses exactly the same number of views as required
for the local structure tensor of the center view. The results of these two purely local
methods can be found under ST AV L for all views, and ST CH L for all views or
just a crosshair, respectively.
Results of both multi-view data terms are denoised with a simple TV-L2 scheme,
algorithms ST AV S and ST CH S. Finally, they were also integrated into a global
energy functional
Z
Z
(46)
ρ(x, u(x)) dx + λ |Du|
E(u) =
Ω
Ω
for a labeling function u : Ω → Λ on the image domain Ω, which is solved to global
optimality using the method in [74]. The global optimization results can be found under
algorithms ST AV G and ST CH G. We compare to our approach. First, we start
with the purely local method EPI L, which estimates orientation using the Eigensystem
analysis of the structure tensor discussed in section 5.1.2. The second method, EPI S,
just performs a TV-L2 denoising of this result, while EPI G employs the globally
optimal labeling scheme of section 5.1.3.2. Finally, the method EPI C performs a
constrained denoising on each epipolar plane image, which takes into account occlusion
ordering constraints [39]. All results are depicted in figure 33.
5.1.6
Experiments and Discussion
The table in figure 33 and the figures 34, 35 show detailed visual and quantitative
disparity estimation results on our benchmark datasets. Algorithm parameters for
all methods were tuned for an optimal structural similarity (SSIM) measure. Strong
72
arguments why this measure should be preferred to the MSE are given in [99], but we
also have computed a variety of other quantities for comparison (however, the detailed
results vary when parameters are optimized for different quantities).
First, one can observe that our local estimate always is more accurate than any of the
multi-view stereo data terms, while using all of the views gives slightly better results
for multi-view than using only the crosshair. Second, our results after applying the
TV-L1 denoising scheme (which takes altogether less than two seconds for all views)
are more accurate than all other results, even those obtained with global optimization
schemes (which takes minutes per view). A likely reason why our results do not become
better with global optimization is that the latter requires a quantization in to a discrete
set of disparity labels, which of course leads to an accuracy loss. Notably, after either
smoothing or global optimization, both multi-view stereo data terms achieve the same
accuracy, see figure 33 - it does not matter that the crosshair data term makes use of
less views, likely since information is propagated across the view in the second step.
This also justifies our use of only two epipolar plane images for the local estimate.
Our method also is the fastest, achieving near-interactive performance for computing
disparity maps for all of the views simultaneously. Note that by construction, the
disparity maps for all views are always computed simultaneously. Performance could
further be increased by restricting the computation on each EPI to a small stripe if only
the result of a specific view is required.
Obviously – when analyzing epipolar plane images – our approach does not use the full
4D light field information around a ray to obtain the local estimates - we just work on
two different 2D cuts through this space. The main reason is performance, in order
to be able to achieve close to interactive speeds, which is necessary for most practical
applications, the amount of data which is used locally must be kept to a minimum.
Moreover, in experiments with a multi-view stereo method, it turns out that using all of
the views for the local estimate, as opposed to only the views in the two epipolar plane
images, does not lead to overall more accurate estimates. While it is true that the local
data term becomes slightly better, the result after optimization is the same. A likely
reason is that the optimization or smoothing step propagates the information across the
view.
73
Orientation Analysis
lightfield
buddha
buddha2
horses
medieval
monasRoom
papillon
stillLife
couple
cube
maria
pyramide
statue
average
EPI L
0.81
1.22
3.60
1.69
1.15
3.95
3.94
0.40
1.27
0.19
0.56
0.88
1.64
EPI S
0.57
0.87
2.12
1.15
0.90
2.26
3.06
0.18
0.85
0.10
0.38
0.33
1.07
EPI C
0.55
0.87
2.21
1.10
0.82
2.52
2.61
0.16
0.82
0.10
0.38
0.29
1.04
EPI G
0.62
0.89
2.67
1.24
0.93
2.48
3.37
0.19
0.87
0.11
0.39
0.35
1.18
Multi-View Stereo
lightfield
buddha
buddha2
horses
medieval
monasRoom
papillon
stillLife
couple
cube
maria
pyramide
statue
average
ST AV L
1.20
2.26
5.29
7.22
2.25
4.84
5.08
0.60
1.28
0.34
0.72
1.56
2.72
ST AV S
0.78
1.05
1.85
0.91
1.05
2.92
4.23
0.24
0.51
0.11
0.42
0.21
1.19
ST AV G
0.90
0.68
1.00
0.76
0.79
3.65
4.04
0.30
0.56
0.11
0.42
0.21
1.12
ST CH L
1.01
3.08
6.14
12.14
2.28
4.85
4.48
1.10
2.25
0.51
1.30
3.39
3.54
ST CH S
0.67
1.31
2.12
1.08
1.02
2.57
3.36
0.24
0.51
0.11
0.43
0.29
1.14
ST CH G
0.80
0.75
1.06
0.79
0.81
3.10
3.22
0.30
0.55
0.11
0.42
0.21
1.01
Figure 33: Detailed evaluation of all disparity estimation algorithms described in section 5.1.5
on all of the data sets in our benchmark. The values in the tables show the mean squared error in
pixels times 100, i.e. a value of “0.81” means that the mean squared error in pixels is “0.0081”.
74
Orientation Analysis
Multi-View Stereo
EPI_S
ST_AV_L
ST_AV_S
stillLife
papillon
monasRoom
medieval
horses
buddha2
buddha
EPI_L
Figure 34: Comparison of the orientation analysis and multi-view stereo (see section 5.1.5) using
the synthetic data of the benchmark database (see section 4). First two columns depict the local
estimated disparity using the structure tensor (EPI L) described in section 5.1.2.1 and the results
after applying a TV-denoising (EPI S). Third and fourth columns depict results from the multi-view
stereo algorithm (ST AV L) described in section 5.1.5 and a TV-denoised version (ST AV S) as well.
75
Orientation Analysis
Multi-View Stereo
EPI_S
ST_AV_L
ST_AV_S
horses
medieval
monasRoom
statue
stillLife
papillon
pyramide
buddha2
buddha
maria
cube
couple
EPI_L
Figure 35: Comparison of the orientation analysis and multi-view stereo (see section 5.1.5) using
the real world data of the benchmark database (see section 4). First two columns depict the local
estimated disparity using the structure tensor (EPI L) described in section 5.1.2.1 and the results
after applying a TV-denoising (EPI S). Third and fourth columns depict results from the multi-view
stereo algorithm (ST AV L) described in section 5.1.5 and a TV-denoised version (ST AV S) as well.
76
5.2
Double Orientation Analysis
While there has been progress in the field of non-Lambertian reconstruction under
controlled lighting conditions [4, 29, 40, 79], it remains quite hard to generalize the
standard matching models to more general reflectance functions if only a set of images
under unknown illumination is available. Previous attempts employ a rank constraint
on the radiance tensor [46] to derive a discrepancy measure for non-Lambertian scenes.
While this improves upon the standard Lambertian matching models and allows to
reconstruct surface reflection parameters, the results still somewhat lack in robustness.
An interesting alternative approach is Helmholtz stereopsis from Zickler et al. [112],
which makes use of the symmetry of reflectance or Helmholtz reciprocity principle in
order to eliminate the view dependency of specular reflections in restricted imaging
setups. By alternating light source and camera at two different locations, one can
obtain a stereo pair where specularities are exactly identical and thus classical matching
techniques can be employed for non-Lambertian scenes. Other works try to remove
reflection data from images using prior assumptions or user input [55, 56].
The works which are most closely related to ours are Sinha et al. [96] and Tsin et al. [90].
They also separate a reflecting surface from the reflection in an epipolar volume data
structure. At their heart, these works still rely on classical correspondence matching,
since they optimize for two overlaid matching models in a nested plane sweep algorithm
using graph cuts or semi-global matching, respectively.
In contrast, in our proposed method we do not try to optimize for correspondence.
Instead, we build upon early ideas in camera motion analysis [16] and investigate directional patterns in epipolar space. In our case, reflections and transparencies manifest
as overlaid structures, which we investigate with higher order structure tensors [1] as a
consequent generalization of section 5.1.
As a result, we obtain a direct continuous method which requires no discretization into
depth labels, and which is highly parallelizable and quite fast: a center view disparity
map for both layers can be obtained in less than two seconds for a reasonably sized light
field, which is around a hundred times faster than even the shortest run-times reported
in [90]. The content in this section is published in Wanner et al. [102], whereby the
theory of the optimization techniques as well as fast CUDA [71] implementations of the
algorithms published in cocolib [37] are the work of Dr. Bastian Goldlücke.
5.2.1
EPI Structure for Lambertian Surfaces
Before discussing the mapping of reflections into the epipolar space let us quickly recap
the model for single orientation, which is equivalent to the assumption of Lambertian
material properties.
77
Π
Ω
y
P = (X,Y,Z)
t
x1
Δs
s1
s2
x2
f
x2-x1 = Z-f Δs
(a) Light field parametrization
(b) Mirror plane geometry
Figure 36: (a) Each camera location (s, t) in the view point plane Π yields a different pinhole view
of the scene. The two thick dashed black lines are orthogonal to both planes, and their intersection
with the plane Ω marks the origins of the (x, y)-coordinate systems for the views (s1 , t) and (s2 , t),
respectively. (b) Geometry of reflection on a planar mirror. All cameras view the reflections of a
scene point p at a planar mirror M as the image of a virtual point p′ which lies behind the mirror
plane. We assume the intensity measured by the sensor has two contributions, an intensity or color
c(m) – the contribution of the reflector m – and a color c(p) – the contribution of the mirrored
object p.
Let P ∈ R3 be a scene point. It is easy to show that the projection of P on each epipolar
plane image is a straight line with slope Zf , where Z is the depth of P , i.e. distance
of P to the plane Π, and f the focal length, i.e. distance between the planes Π and
Ω (compare sections 2.2, 2.3 and figure 36 a). The quantity Zf (equation 11) is called
the disparity of P . In particular, the above means that if P is a point on an opaque
Lambertian surface, then for all points on the epipolar plane image where the point P
is visible, the light field L (equation 9) must have the same constant intensity. This
is the reason for the single pattern of solid lines which we can observe in the epipolar
plane images of a Lambertian scene. In section 5.1, this well-known observation was
the foundation for a novel approach to depth estimation, which leveraged the structure
tensors of the epipolar plane images in order to estimate the local orientation and thus
the disparity of the observed point visible in the corresponding ray. While in conjunction
with visibility constraints this leads to a certain robustness against specular reflections,
the image formation model implicitly underlying this method is still the Lambertian
one, thus the method cannot deal correctly with reflecting surfaces. Furthermore, it is
not possible to infer information for both the surface And a possible reflection. The
following sections will propose a more general model to remedy this.
5.2.2
EPI Structure for Planar Reflectors
We now introduce an idealized appearance model for the epipolar plane images in the
presence of a planar mirror - a translucent surface is an obvious specialization where
78
a real object takes the place of the virtual one behind the mirror. It is kept simple in
order to arrive at a computationally tractable model, but we will see that it captures
the characteristics of reflective and translucent surfaces reasonably well to be able to
cope with real-world data. A similar appearance model was successfully employed in [90].
Let M ⊂ R3 be the surface of a planar mirror. We fix coordinates (y ∗ , t∗ ) and consider
the corresponding epipolar plane image Ly∗ ,t∗ . The idea of the appearance model is
to define the observed color for a ray at location (x, s) which intersects the mirror at
m ∈ M . Our simplified assumption is that the observed color is a linear combination
of two contributions. The first is the base color c(m) of the mirror, which describes
the appearance of the mirror without the presence of any reflection. The second is the
color c(p) of the reflection, where p is the first scene point where the reflected ray intersects
the scene geometry, see Figure 36(a). We do not consider higher order reflections, and
assume the surface at p to be Lambertian. We also assume the reflectivity α > 0 is a
constant independent of viewing direction and location. The epipolar plane image itself
will then be a linear combination
V
Ly∗ ,t∗ = LM
y ∗ ,t∗ + αLy ∗ ,t∗
(47)
V
of a pattern LM
y ∗ ,t∗ from the mirror surface itself as well as a pattern Ly ∗ ,t∗ from the
virtual scene behind the mirror. In each point (x, s) as above, both constituent patterns
have a dominant direction corresponding to the disparities of m and p. The next section
shows how to extract these two dominant directions.
M
L y*,t*
α L y*,t*
V
L y*,t* = L y*,t* + αL y*,t*
M
M
V
V
Figure 37: Illustration of overlayed signals LM
y ∗ ,t∗ , Ly ∗ ,t∗ and Ly ∗ ,t∗
5.2.3
Analysis of Multiorientation Patterns
We briefly summarize the theory for the analysis of superimposed patterns described in
Aach et al. [1]. A region R ⊂ Ω of an image f : Ω → R has orientation v ∈ R2 if and
only if
f (x) = f (x + αv) ∀ x, x + αv ∈ R.
(48)
Analysis shows that the orientation v is given by the Eigenvector corresponding to the
smaller Eigenvalue of the structure tensor [12] of f . However, the model fails if the
79
Figure 38: Exemplary epipolar plane images showing double orientation patterns from reflections.
image f is a superposition of two oriented images, f = f1 + f2 , where f1 has orientation
u and f2 has orientation v. In this case, the two orientations u, v need to satisfy the
conditions
uT ∇f1 = 0 and vT ∇f2 = 0
(49)
individually on R. Analogous to the single orientation case, the two orientations in
a region R can be found by performing an Eigensystem analysis of the second order
structure tensor, see Aach et al. [1],
 2

Z
fxx
fxx fxy fxx fyy
2
fxy
fxy fyy  d(x, y),
(50)
T =
σ fxx fxy
2
R
fxx fyy fxy fyy
fyy
where σ is a (usually Gaussian) weighting kernel on R which essentially determines the
size of the sampling window. Since T is symmetric, we can compute Eigenvalues and
Eigenvectors in a straight-forward manner using the explicit formulas in [91]. Analogous
to the Eigenvalue decomposition of the 2D structure tensor, the Eigenvector a ∈ R3
corresponding to the smallest Eigenvalue of T , the so-called MOP vector, encodes the
orientations. Indeed, the two disparities are equal to the Eigenvalues λ+ , λ− of the 2 × 2
matrix
a2 /a1
−a3 /a1
,
(51)
1
0
from which one can compute the orientations u = [λ+ 1]T and v = [λ− 1]T .
5.2.4
Merging into Single Disparity Maps
From the steps sketched above, we obtain three different disparity estimates for both
the horizontal as well as vertical epipolar images: one from the single orientation model,
and two from the double orientation model. It is clear that the closer estimate in the
double orientation model will always correspond to the primary surface, regardless of
whether it is a mirror or translucent object. Unfortunately, we do not know yet of a
reliable mathematical measure which tells us whether the two-layer model is valid or
not. We therefore impose a simple heuristic: if at a given point, the disparity values
80
Single channel
Mirror channel
Reflection channel
α = 0.9
α = 0.5
α = 0.1
Center view
Reflection
coefficient
Point-wise result
Single orientation
Double orientation
mirror
reflection
mirror
reflection
Result after TV-L2 denoising
Single orientation
Double orientation
mirror
reflection
mirror
reflection
α = 0.1
0.0034
0.7409
0.0078
0.1191
0.0025
0.7392
0.0036
0.09924
α = 0.3
0.0236
0.5994
0.0061
0.0349
0.0086
0.6273
0.0032
0.02371
α = 0.5
0.0869
0.3735
0.0066
0.0236
0.0252
0.5111
0.0036
0.01377
α = 0.7
0.1807
0.1547
0.0101
0.0239
0.1434
0.1821
0.0060
0.01053
α = 0.9
0.2579
0.0365
0.0389
0.0473
0.2557
0.0312
0.0249
0.00980
Figure 39: Influence of reflectivity on accuracy. The table shows mean squared disparity error in
pixels of the single and double orientation model for both the mirror plane as well as the reflection.
While the single orientation model shifts from reconstruction of mirror to reflection with growing
reflectivity α, the double orientation model can still reconstruct both when even a human observer
has difficulties separating them. The images show the point-wise results.
of horizontal and vertical EPIs agree up to a small error for both the primary and
secondary orientation, we flag the double orientation model as valid, and choose its
contribution in the disparity maps. Otherwise, we choose the estimate from the single
orientation model.
5.2.5
Results
We compare our method primarily to the single orientation method (Wanner et al. [100]
and section 5.1) based on the first order structure tensor, which is similar in spirit and an
initial step in our algorithm in any case. However, it is clear that any multi-view stereo
method will have similar problems as the single orientation method if the underlying
model is also the Lambertian world.
81
Center view
Single orientation
Proposed double orientation model
(mirror channel)
(reflection channel)
(detected mask)
Figure 40: In the absence of a structured background, the reflecting surface can of course only
reliably be detected where a reflection of a foreground object is visible. The blue region indicates
where the double orientation model returns valid results.
5.2.5.1 Synthetic Data Sets Figure 39 shows reconstruction accuracy on a synthetic light field with varying amounts of reflectivity α. The scene was ray-traced
in a way which exactly fits the image formation model. As expected, the disparity
reconstructed with the single orientation model is close to the disparity of the mirror
surface if α is small, and close to the disparity of the reflection if α is large. In between,
the result is a mixture between the two, depending on whose texture is stronger. In
contrast, the double orientation model can reliably reconstruct both reflection as well as
mirror surface for the full range of reflectivities α, even when it is already difficult for a
human to still observe both. While the point-wise results are already very accurate, they
are still quite noisy and can be greatly improved by adding a small amount of TV-L2
denoising [20]. We deliberately do not employ more sophisticated global optimization
in this step to showcase only the raw output from the model and what is possible at
interactive performance levels. For all of the light fields shown, at image resolutions
upwards of 512 × 512 with 9 × 9 views, the point-wise disparity computation for the
whole center view takes less than 1.5 seconds on an nVidia GTX 680 GPU.
5.2.5.2 Real-World Data Sets In Figures 41, 40, and 42, we show reconstruction
results for light fields recorded with our gantry, see Figure 36(b). Each one has 9 × 9
views at resolutions between 0.5 and 1 mega-pixels. For both reflective and transparent
surfaces, a reconstruction of a single disparity based on the Lambertian assumption
produces major artifacts and is unusable in the region of the surface. In contrast, the
proposed method always produces a very reliable estimate for the primary surface, as
well as a reasonably accurate one for the reflected or transmitted objects, respectively.
For the results in the figures, we employed a global optimization scheme [74, 100] to
reach maximum possible quality, which takes about 3 minutes per disparity map. The
same scheme and parameters were used for both methods and all data sets. To show
what is possible in near real-time, we also provide the raw point-wise results in the
additional material.
The results show that certain apparent limitations of the model are not practically
82
Center view
Single orientation
Double orientation model
(front layer)
(back layer)
Figure 41: Reconstructing a transparent surface. The single orientation model cannot distinguish
the two signals from the dirty glass surface and the objects behind it. In contrast, multi-orientation
analysis correctly separates both layers.
relevant. In particular, reflectivity α is certainly not constant everywhere due to
influences of e.g. the Fresnel term, but since all estimates are strictly local and the
angular range small, the variations do not seem to impact the final result by much. A
stronger limitation, however, is the planarity of the reflecting or transparent surface.
We predict that it can be considerably weakened, since the main assumption of the
existence of an object “behind” the primary surface (which is of course only virtual
in case of a mirror) also holds for more general geometries. However, exploring this
direction is left for future work.
83
Center view
Single orientation
Double orientation model
(mirror disparity)
(reflection disparity)
Figure 42: Reconstructing a mirror. Like multi-view stereo algorithms, the single orientation model
cannot distinguish the two signals from mirror plane and reflection and reconstructs erroneous disparity for the mirror plane. In contrast, the proposed double orientation analysis correctly separates
the data for the mirror plane from the reflection. The reflection channel is masked out where the
double orientation model does not return valid results as specified in section 5.2.4, and the results
for this channel have been increased in brightness and contrast for better visibility (raw results and
many more data sets can be observed in the additional material).
84
6
Inverse Problems on Ray Space
In this section, we discuss some applications showing that when using light fields and their
inherently available depth information, much better results can be achieved compared to
classical approaches. The first application is an adaptation of super-resolution techniques
to light fields which additionally is – as a side effect – a proof for the high accuracy of
the orientation analysis because a necessary condition for the proposed super-resolution
algorithm are depth maps of sub-pixel accuracy.
Another application is object segmentation where we will see that by adapting classical
methods to light fields we can improve segmentation accuracy compared to segmentation
using single images.
6.1
Spatial and Viewpoint Superresolution
Here, we propose a variational model for the synthesis of super-resolved novel views.
The theoretical background of the variational methods used in this section is the work
of Dr. Bastian Goldlücke. Fast GPU implementations of the algorithms can be found
in his open source library cocolib [37]. The content is already published in Wanner et
al. [101] and Wanner et al. [103].
Since the model is continuous, we will be able to derive Euler-Lagrange equations which
correctly take into account foreshortening effects of the views caused by variations in
the scene geometry. This makes the model essentially parameter-free. The framework
is in the spirit of [38], which computes super-resolved textures for a 3D model from
multiple views, and shares the same favorable properties. However, it has substantial
differences, since we do not require a complete 3D geometry reconstruction and costly
computation of a texture atlas. Instead, we only make use of disparity maps on the
input images, and model the super-resolved novel view directly.
The following mathematical framework is formulated for views with arbitrary projections.
However, an implementation in this generality would be quite difficult to achieve. We
therefore specialize to the scenario of a 4D light field in the subsequent section, and
leave a generalization of the implementation for future work.
For the remainder of the section, assume we have images vi : Ωi → R of a scene available,
which are obtained by projections πi : R3 → Ωi . Each pixel of each image stores the
integrated intensities from a collection of rays from the scene. This sub-sampling process
is modeled by a blur kernel b for functions on Ωi , and essentially characterizes the point
spread function for the corresponding sensor element. It can be measured for a specific
imaging system [8]. In general, the kernel may depend on the view and even on the
specific location in the images. We omit the dependency here for simplicity of notation.
The goal is to synthesize a view u : Γ → R of the light field from a novel view point,
represented by a camera projection π : R3 → Γ, where Γ is the image plane of the
85
novel view. The basic idea of super-resolution is to define a physical model for how the
sub-sampled images vi can be explained using high-resolution information in u, and then
solve the resulting system of equations for u. This inverse problem is ill-posed, and is thus
reformulated as an energy minimization problem with a suitable prior or regularizer on u.
Σ
P
Ωi
x'
τi(x) =π(P)
x
τi
ci
c
Γ
Figure 43: Transfer map τi from an input image plane Ωi to the image plane Γ of the novel view
point. The scene surface Σ can be inferred from the depth map on Ωi . Note that not all points
x ∈ Ωi are visible in Γ due to occlusion, which is described by the binary mask mi on Ωi . Above,
mi (x) = 1 while mi (x′ ) = 0.
6.1.1
Image Formation and Model Energy
In order to formulate the transfer of information from u to vi correctly, we require
geometry information [19]. Thus, we assume we know (previously estimated) depth
maps di (see section 5) for the input views. A point x ∈ Ωi is then in one-to-one
correspondence with a point P which lies on the scene surface Σ ⊂ R3 . The color of the
scene point can be recovered from u via u ◦ π(P ), provided that x is not occluded by
other scene points, see figure 43.
The process explained above induces a backwards warp map τi : Ωi → Γ which tells
us where to look on Γ for the color of a point, as well as a binary occlusion mask
mi : Ωi → {0, 1} which takes the value 1 if and only if a point in Ωi is also visible in Γ.
Both maps only depend on the scene surface geometry as seen from vi , i.e. the depth
map di . The different terms and mappings appearing above and in the following are
visualized for an example light field in figure 44.
Having computed the warp map, one can formulate a model of how the values of vi within
the mask can be computed, given a high-resolution image u. Using the down-sampling
kernel, we obtain vi = b ∗ (u ◦ τi ) on the subset of Ωi where mi = 1, which consists
of all points in vi which are also visible in u. Since this equality will not be satisfied
exactly due to noise or inaccuracies in the depth map, we instead propose to minimize
86
the energy
E(u) = σ
2
Z
|Du| +
Γ
Z
n
X
1
i=1
mi (b ∗ (u ◦ τi ) − vi )2 dx .
2
| Ωi
{z
}
(52)
i
=:Edata
(u)
which is the MAP [30] (maximum a posteriori estimation) estimate under the assumption
on Γ
disparity map di
forward warp vi β i
low - resolution
on Ωi
high - resolution
input view vi
backward warp u τi
~i
weighted mask m
visibility mask m i
novel view u
Figure 44: Illustration of the terms in the super-resolution energy. The figure shows the ground
truth depth map for a single input view and the resulting mappings for forward- and backward
warps as well as the visibility mask mi . White pixels in the mask denote points in Ωi which are
visible in Γ as well.
of Gaussian noise with standard deviation σ on the input images. It resembles a classical
super-resolution model [8], which is made slightly more complex by the inclusion of the
warp maps and masks.
In the energy, formulated in equation 52, the total variation acts as a regularizer or
objective prior on u. Its main tasks are to eliminate outliers and enforce a reasonable
in-painting of regions for which no information is available, i.e. regions which are not
visible in any of the input views. It could be replaced by a more sophisticated prior for
natural images, however, the total variation [78] leads to a convex model which can be
very efficiently minimized. Furthermore, the regularization weight λ, which is the only
free parameter of the model, is usually set very low in order to not destroy any details
in the reconstruction. We have it at 0.0001 in all experiments, which makes the exact
choice of regularizer not very significant.
87
6.1.2
Functional Derivative
The functional derivative for the inverse problem above is required in order to find
solutions. It is well-known in principle, but one needs to take into account complications
caused by the different domains of the integrals. Note that τi is one-to-one when
restricted to the visible region Vi := {mi = 1}, thus we can compute an inverse forward
warp map βi := (τi |Vi )−1 , which we can use to transform the data term integral back to
the domain Γ, see figure 44. We obtain for the derivative of a single term of the sum in
equation 52
i
dEdata
(u) = |det Dβi | mi b̄ ∗ (b ∗ (u ◦ τi ) − vi ) ◦ βi .
(53)
The determinant is introduced by the variable substitution of the integral during the
transformation. A more detailed derivation for a structurally equivalent case can be
found in [38].
The term |det Dβi | in equation 53 introduces a point-wise weight for the contribution of
each image to the gradient descent. However, βi depends on the depth map on Γ, which
needs to be inferred and is not readily available. Furthermore, for efficiency it needs
to be pre-computed, and storage would require another high-resolution floating point
matrix per view. Memory is a bottleneck in our method, and we need to avoid this.
For this reason, it is much more efficient to transform the weight to Ωi and multiply it
with mi to create a single weighted mask. Note that
|det Dβi | = det Dτi−1 = |det Dτi |−1 ◦ βi .
(54)
Thus, we obtain a simplified expression for the functional derivative,
i
dEdata
(u) = m̃i b̄ ∗ (b ∗ (u ◦ τi ) − vi ) ◦ βi
(55)
with m̃i := mi |det(Dτi )|−1 . An example weighted mask is visualized in figure 44. In
total, only the weighted mask m̃i needs to be pre-computed and stored for each view.
In the scenario we present in the next section, the warp maps will be simple and can be
computed on the fly from just the disparity map.
88
6.1.3
Specialization to 4D Light Fields
The model introduced until now is hard to implement efficiently in fully general form.
Thus we focus on the setting of a 4D light field, where we can make a number of significant
simplifications. The main reason is that the warp maps between the views are given by
parallel translations in the direction of the view point change. The amount of translation
is proportional to the disparity of a pixel, which is in one-to-one correspondence with
the depth, as explained in sections 2.2, 5.1.2. How the disparity maps are obtained
does not matter, but in this work, naturally, they will be computed using the technique
described in section 5.
6.1.4
View Synthesis in the Light Field Plane
The warp maps required for view synthesis become particularly simple when the target
image plane Γ lies in the common image plane Ω of the light field, and π resembles
the corresponding light field projection through a focal point c ∈ Π. In this case, τi is
simply given by a translation proportional to the disparity,
τi (x) = x + di (x)(c − ci ),
(56)
see figure 45. Thus, one can compute the weight in equation 55 to be
|det Dτi |−1 = |1 + ∇di · (c − ci )|−1
(57)
There are a few observations to make about this weight. Disparity gradients which
are not aligned with the view translation ∆c = c − ci do not influence it, which makes
sense since it does not change the angle under which the patch is viewed. Disparity
gradients which are aligned with ∆c and tend to infinity lead to a zero weight, which
also makes sense since they lead to a large distortion of the patch in the input view and
thus unreliable information.
A very interesting result is the location of maximum weight. The weights become larger
when ∆c · ∇di approaches −1. An interpretation can be found in figure 45. If ∆c · ∇di
gets closer to −1, then more information from Ωi is being condensed onto Γ, which
means that it becomes more reliable and should be assigned more weight. The extreme
case is a line segment with a disparity gradient such that ∆c · ∇di = −1, which is
projected onto a single point in Γ. In this situation, the weight becomes singular. This
does not pose a problem: From a theoretical point of view, the set of singular points is a
null set according to the theorem of Sard [80], and thus not seen by the integral. From
a practical point of view, all singular points lead to occlusion and the mask mi is zero
anyway. Note that formula 57 is non-intuitive, but the correct one to use when geometry
is taken into account. We have not seen anything similar being used in previous work.
Instead, weighting factors for view synthesis are often imposed according to measures
based on distance to the interpolated rays or matching similarity scores, which are
certainly working, but also somewhat heuristic strategies [35, 51, 59, 76].
89
y
Ω
x+di(x)Δc = y+di(y)Δc
d (x)-d (y)
=-1
Δc i x-y i
x
Δc
ci
c
Figure 45: The slope of the solid blue line depends on the disparity gradient in the view vi . If
∆c · ∇di = −1, then the line is projected onto a single point in the novel view u.
6.1.5
Results
For the optimization of the (convex) energy in equation 52, we transform the gradient
to the space of the target view via equation 55, discretize, and employ the fast iterative
shrinkage and thresholding algorithm (FISTA) found in [9].
In order to demonstrate the validity and robustness of our algorithm, we perform
extensive tests on our synthetic light fields, where we have ground truth available, as
well as on real-world data sets from a plenoptic camera. As a by-product, this establishes
again that disparity maps obtained by our proposed method in section 5 have subpixel
accuracy, since this is a necessary requirement for super-resolution to work.
View Interpolation and Superresolution In a first set of experiments, we show
the quality of view interpolation and super-resolution, both with ground truth as well as
estimated disparity. In table 47, we synthesize the center view of a light field with our
algorithm using the remaining views as input, and compare the result to the actual view.
For the down-sampling kernel b, we use a simple box filter of size equal to the downsampling factor, so that it fits exactly on a pixel of the input views. We compute results
both with ground truth disparities to show the maximum theoretical performance of the
algorithm, as well as for the usual real-world case that disparity needs to be estimated.
This estimation is performed using the local method described in section 5.1.2.1, so
requires less than five seconds for all of the views. Synthesizing a single super-resolved
view requires about 15 seconds on an nVidia GTX 580 GPU.
In order to test the quality of super-resolution, we compute the 3 × 3 super-resolved
center view and compare with ground truth. For reference, we also compare the result
of bilinear interpolation (IP) as well as TV-zooming [20] of the center view synthesized
90
original high resolution center view
super-resolution result, 7x7 input views
TV zooming
bilinear upsampling to 4x resolution
closeup of center view at low resolution
Figure 46: Comparison of the different up-sampling schemes on the light field of a resolution chart.
Input resolution is 512 × 512, which is 4× up-sampled. From left to right: original low resolution input,
bilinear up-sampling, TV zooming [20], our result, the original 1024 × 1024 center view for comparison.
All images shown are closeups.
in the first experiment. While the reconstruction with ground truth disparities is very
precise, we can see that in the case of estimated disparity, the result strongly improves
with larger angular resolution due to better disparity estimates (compare figure 30).
Super-resolution is superior to both competing methods. This also emphasizes the subpixel accuracy of the disparity maps, since without accurate matching, super-resolution
would not be possible. Figures 48 and 46 show closeup comparison images of the input
light fields and up-sampled novel views obtained with different strategies. At this zoom
level, it is possible to observe increased sharpness and details in the super-resolved
results. Figure 46 indicates that the proposed scheme also produces the least amount of
artifacts.
Figures 51 and 50 show the results of the same set of experiments for two real-world
scenes captured with the Raytrix plenoptic camera. The plenoptic camera data was
transformed to the standard representation as an array of 9 × 9 views using the method
in section 3. Since no ground truth for the scene is available, the input views were
down-sampled to lower resolution before performing super-resolution and compared
against the original view. We can see that the proposed algorithm allows to accurately
reconstruct both sub-pixel disparity as well as a high-quality super-resolved intermediate
view.
91
1x1
32.2
32.2
31.8
28.0
30.7
31.4
IP
26.5
26.5
27.2
24.3
27.7
26.8
1x1
30.1
30.0
30.2
26.4
28.9
29.5
Mona
3x3 TV
28.3 27.4
28.3 27.4
28.9 27.8
28.3 25.7
28.3 26.8
28.3 27.1
IP
26.4
26.3
26.5
23.8
25.1
25.8
GT
IP
26.5
26.5
26.0
25.8
26.2
24.3
Buddha
3x3 TV
28.9 27.5
29.1 27.5
30.2 28.8
28.9 25.8
29.1 28.9
29.5 27.9
ED
Views
5×5
9×9
17 × 17
5×5
9×9
17 × 17
Conehead
1x1 3x3 TV
31.6 29.3 27.4
31.6 29.4 27.5
31.2 30.4 27.3
31.1 29.3 27.1
31.4 29.4 27.6
31.5 30.9 25.9
Figure 47: Reconstruction error for the data sets obtained with a ray-tracer. The table shows the
PSNR of the center view without super-resolution, at super-resolution magnification 3 × 3, and for
bilinear interpolation (IP) and TV-Zooming (TV) [20] to 3 × 3 resolution as a comparison. The set
of experiments is run with both ground truth (GT) and estimated disparities (ED). The estimation
error for the disparity map can be found in figure 30. Input image resolution is 384 × 384.
Disparity Refinement As we have seen in figure 49, the disparity estimate is more
accurate when the angular sampling of the light field is more dense. An idea is therefore to increase angular resolution and improve the disparity estimate by synthesizing
intermediate views.
We first synthesize novel views to increase angular resolution by a factor of 2 and 4.
Figure 49 shows resulting epipolar plane images, which can be seen to be of high
quality with accurate occlusion boundaries. Nevertheless, it is highly interesting that
the quality of the disparity map increases significantly when recomputed with the
super-resolved light field, figure 49. This is a striking result, since one would expect
that the intermediate views reflect the error in the original disparity maps. However,
they actually provide more accuracy than a single disparity map, since they represent a
consensus of all input views. Unfortunately, due to the high computational cost, this is
not a really viable strategy in practice.
92
(a) Buddha
(b) Mona
Figure 48: Closeups of the up-sampling results for the light fields generated with a ray tracer.
From left to right: low-resolution center view (not used for reconstruction), high resolution center
view obtained by bilinear interpolation of a low-resolution reconstruction from 24 other views, TVZooming [20], super-resolved reconstruction. The super-resolved result shows increased sharpness
and details.
5 x 5 input views
super resolved to 9 x 9
super resolved to 17 x 17
Figure 49: Left: Up-sampling of epipolar plane images (EPIs). From Top to bottom the five layers
of an epipolar plane image of the input data set with 5×5 views, the super resolved 7×7 and the super
resolved 17 × 17 views are depicted. We generate intermediate views using our method to achieve
angular super-resolution. One can observe the high quality and accurate occlusion boundaries of
the resulting view interpolation. Right: Indeed, they are accurate enough such that using the upsampled EPIs leads to a further improvement in depth estimation accuracy. Here the mean square
errors for all angular resolutions as well as the color coded error distribution of the depth error
before and after super-resolution are shown.
93
(a) Demo
(b) Motor
Figure 50: Super-resolution view synthesis using light fields from a plenoptic camera. Scenes were
recorded with a Raytrix camera at a resolution of 962 × 628 and super-resolved by a factor of 3 × 3.
The light field contains 9 × 9 views. From left to right: low-resolution center view (not used for
reconstruction), high resolution center view obtained by bilinear interpolation of a low-resolution
reconstruction from 24 other views, TV-Zooming [20], super-resolved reconstruction. One can find
additional detail, for example the diagonal stripes in the Euro note, which were not visible before.
Figure 51: Reconstruction error for light fields captured with the Raytrix plenoptic camera. The
table shows PSNR for the reconstructed input view at original resolution as well as 3 × 3 superresolution and 3 × 3 interpolation (IP) and TV-Zooming (TV) [20] for comparison.
94
6.2
Rayspace Segmentation
Here we present the first variational framework for multi-label segmentation on the ray
space of 4D light fields. For traditional segmentation of single images, features need
to be extracted from the 2D projection of a three-dimensional scene. The associated
loss of geometry information can cause severe problems, for example if different objects
have a very similar visual appearance. In this section, we show that using a light field
instead of an image not only enables to train classifiers which can overcome many of
these problems, but also provides an optimal data structure for label optimization by
implicitly providing scene geometry information. Thus it is possible to consistently
optimize label assignment over all views simultaneously.
Recent developments in light field acquisition systems [14, 64, 69, 72] strengthen the prediction that we might soon enter an age of light field photography [57]. Since compared
to a single image, light fields increase the content captured of a scene by directional
information, they require an adaptation of established algorithms in image processing
and computer vision as well as the development of completely novel techniques. Here, we
develop methods for training classifiers on features of a light field, and for consistently
optimizing label assignments to rays in a global variational framework. The ray space
of the light field is considered four-dimensional, parametrized by the two points of
intersection of a ray with two parallel planes, so that the light field can be considered as
a collection of planar views, see figures 6 and 5.
Due to this planar sampling, 3D points are projected onto lines in cross-sections of the
light field called epipolar-plane images (section 2.3). In recent works, it was shown that
robust disparity reconstruction is possible by analyzing this line structure [11, 16, 27, 100]
(see also section 5). In contrast to traditional stereo matching, no correspondence search
is required, and floating-point precision disparity data can be reconstructed at a very
small cost.
From the point of view of segmentation, this means that in light fields, we have access to
more than the color of a pixel and information about the neighboring image texture. Additionally, we can assume that disparity is readily available as a feature. Disparity turns
out to be highly effective for increasing the prediction quality of a classifier. As long as
the inter-class variety of imaged objects is high and the intra-class variation is low, state
of the art classifiers can easily discriminate different objects. However, separating for example background and foreground leafs (example in figure 52) poses a more difficult task.
In general, there is no easy way to alleviate issues like this using only single images.
However, for a classifier which also has geometry based features available, similar looking
objects are readily distinguishable if their geometric features are separable.
In the following, we will show that light fields are ideally suited for image segmentation.
One reason is that geometry is an inherent characteristic of a light field, and thus we
95
Input user scribbles
single view labeling
ray space labeling
Figure 52: Multi-label segmentation with light field features and disparity-consistent regularization
across ray space leads to results which are superior to single-view labeling.
can use disparity as a very helpful additional feature. While this has already been
realized in related work on e.g. multi-view co-segmentation [50] or segmentation with
depth or motion cues, which are in many aspects similar to disparity [31, 92], light
fields also provide an ideal structure for a variational framework which readily allows
consistent labelling across all views, and thus increases the accuracy of label assignments
dramatically.
6.2.1
Regularization on Ray Space
In segmentation problems, when one wants to label rays according to e.g. the visible
object class, the unknown function on ray space ultimately reflects a property of scene
points. In consequence, all the rays which view the same scene point have to be assigned
the same function value. Equivalent to this is to demand that the function must be
consistent with the structure on the epipolar plane images. In particular, except at depth
discontinuities, the value of such a function is not allowed to change in the direction of
the epipolar lines, which are induced by the disparity field.
The above considerations give rise to a regularizer Jλµ (U ) for vector-valued functions U :
R → Rn on ray space. It can be written as the sum of contributions for the regularizers
on all epipolar plane images as well as all the views,
Jλµ (U ) = µJxs (U ) + µJyt (U ) + λJst (U )
Z
with Jxs (U ) = Jρ (Ux∗ ,s∗ ) d(x∗ , s∗ ),
Z
Jyt (U ) = Jρ (Uy∗ ,t∗ ) d(y ∗ , t∗ ),
Z
and Jst (U ) = JV (Us∗ ,t∗ ) d(s∗ , t∗ ),
(58)
where the anisotropic regularizers Jρ act on 2D epipolar plane images, and are defined
96
such that they encourage smoothing in the direction of the epipolar lines. This way, they
enforce consistency of the function U with the epipolar plane image structure. For a
detailed definition, we refer to our related work [39]. The spatial regularizer JV encodes
the label transition costs, as we will explore in more detail in the next section. Finally,
the constants λ > 0 and µ > 0 are user-defined and adjust the amount of regularization
on the separate views and epipolar plane images, respectively.
6.2.2
Optimal Label Assignment on Ray Space
In this section, we introduce a new variational labeling framework on ray spaces. Its
design is based on the representation of labels with indicator functions [22, 53, 111],
which leads to a convex optimization problem. We can use the efficient optimization
framework presented in [39] to obtain a globally optimal solution to the convex problem,
however, as usual we need to project back to indicator functions and only end up within
a (usually small) posterior bound of the optimum.
The Variational Multi-Label Problem. Let Γ be the (discrete) set of labels, then
to each label γ ∈ Γ we assign a binary function uγ : R → {0, 1} which takes the value 1
if and only if a ray is assigned the label γ. Since the assignment must be unique, the set
of indicator functions must satisfy the simplex constraint
X
uγ = 1.
(59)
γ∈Γ
Arbitrary spatially varying label cost functions cγ can be defined, which penalize the
assignment of γ to a ray R ∈ R with the cost cγ (R) ≥ 0.
Let U be the vector of all indicator functions. To regularize U , we choose Jλµ defined
in equation 58. This implies that the labelling is encouraged to be consistent with the
epipolar plane structure of the light field to be labelled. The spatial regularizer JV needs
to enforce the label transition costs. For the remainder of this work, we choose a simple
weighted Potts penalizer [110]
Z
1X
JV (Us∗ ,t∗ ) :=
g |(Duγ )s∗ t∗ | d(x, y),
(60)
2 γ∈Γ Ω
where g is a spatially varying transition cost. Since the total variation of a binary
function equals the length of the interface between the zero and one level set due to the
co-area formula [32], the factor 1/2 leads to the desired penalization.
While we use the weighted Potts model in this work, the overall framework is by no
means limited to it. Rather, we can use any of the more sophisticated regularizers
proposed in the literature [22, 53], for example truncated linear penalization, Euclidean
label distances, Huber TV or the Mumford-Shah regularizer. An overview as well as
97
further specializations tailored to vector-valued label spaces can be found in [94].
The space of binary functions over which one needs to optimize is not convex, since
convex combinations of binary functions are usually not binary. We resort to a convex
relaxation, which with the above conventions can now be written as
(
)
XZ
cγ uγ d(x, y, s, t) ,
(61)
argmin Jλµ (U ) +
U ∈C
γ∈Γ
R
where C is the convex set of functions U = (uγ : R → [0, 1])γ∈Γ which satisfy the simplex
constraint equation 59. After optimization, the solution of equation 61 needs to be
projected back onto the space of binary functions. This means that we usually do not
achieve the global optimum of equation 61, but can only compute a posterior bound for
how far we are from the optimal solution. An exception is the two-label case, where we
indeed achieve global optimality via thresholding, since the anisotropic total variation
also satisfies a co-area formula [111].
Optimization. Note that according to equation 58, the full regularizer Jλµ which
is defined on 4D ray space decomposes into a sum of 2D regularizers on the epipolar
plane images and individual views, respectively. While solving a single saddle point
problem for the full regularizer would require too much memory, it is feasible to iteratively compute independent descent steps for the data term and regularizer components.
The overall algorithm is detailed in [39]. Aside from the data term, the main difference
here is the simplex constraint set for the primal variable U . We enforce it with Lagrange
multipliers in the proximity operators of the regularizer components, which can be easily
integrated into the primal-dual algorithm [21]. An overview of the algorithm adapted to
problem in equation 61 can be found in figure 53.
On our system equipped with an nVidia GTX 580 GPU, optimization takes about 1.5
seconds per label in Γ and per million rays in R, i.e. about 5 minutes for our rendered
data sets if the result for all views is desired. If only the result for one single view
(i.e. the center one) is required, computation can be restricted to view points located
in a cross with that specific view at the center. The result will usually be very close
to the optimization over the complete ray space. While this compromise forfeits some
information in the data it leads to significant speeds ups, for our rendered data sets to
about 30 seconds.
6.2.3
Local Class Probabilities
We calculate the unary potentials cγ in equation 61 from the negative log-likelihoods of
the local class probabilities,
cγ (R) = − log p γ|v(R) ,
(62)
98
To solve the multi-label problem in equation 61 on ray space, we initialize the
unknown vector-valued function U such that the indicator function for the optimal
point-wise label is set to one, and zero otherwise. Then we iterate
• data term descent: Uλ ← Uλ − τ cλ for all λ ∈ Λ,
• EPI regularizer descent:
Ux∗ s∗ ← proxτ µJρ (Ux∗ s∗ ) for all (x∗ , s∗ ),
Uy∗ t∗ ← proxτ µJρ (Uy∗ t∗ ) for all (y ∗ , t∗ ),
• spatial regularizer descent:
Us∗ t∗ ← proxτ λJV (Us∗ t∗ ) for all (s∗ , t∗ ).
The proximation operators proxJ compute subgradient descent steps for the respective
2D regularizer, and enforce the simplex constraint in equation 59 for U . The possible
step size τ depends on the data term scale, in our experiments τ = 0.1 lead to reliable
convergence within about 20 iterations.
Figure 53: Algorithm for the general multi-label problem in equation 61.
so that by solving equation 61, we obtain the maximum a-posteriori(MAP) solution [30]
for the label assignment. The local class probabilities p γ|v(R) ∈ [0, 1] for the label γ, conditioned on a local feature vector v(R) ∈ R|F | for each ray R ∈ R, are
obtained by training a classifier on a user-provided partial labelling of the center view.
As features, we use a combination of color, Laplace operator of the view, intensity
standard deviation in a neighbourhood, Eigenvalues of the Hessian and the disparity
computed on several scales. While our framework allows the use of arbitrary classifiers,
we specialize in this thesis to a Random Forest [18]. These are becoming increasingly
popular in image processing due to their wide applicability [26] and the robustness
with regard to their hyper-parameters. Random Forests make use of bagging to reduce
variance and avoid over-fitting. A decision forest is built from a number n of trees,
which are each trained from a random subset of the available training samples. In
addition to bagging, extra randomness is injected into the trees by testing only a subset
of m < |F | different features for their optimal split
p in each split node. The above internal random forest parameters were fixed to m = |F | and n = 71 in our experiments.
Each individual tree is now built by partitioning the set of training samples recursively
into smaller subsets, until the subsets become either class-pure or smaller than a given
minimal split node size. The partitioning of the samples is achieved by performing
a line search over all possible splits along a number of different feature axes for the
optimal Gini-impurity of the resulting partitions, and repeating this process for the
child partitions recursively. In each node, the chosen feature and the split value of that
99
Features used
IMG
RGB value
Intensity standard deviation
(in local neighbourhood)
Eigenvalues of Hessian
Laplace operator
Estimated disparity
Ground truth disparity
Classifier
IMG-D IMG-GT
X
X
X
X
X
X
X
X
X
X
X
X
Figure 54: Combination of features used for the experiments in this
paper. The individual scales of the features were determined via a grid
search to find optimal parameters for each dataset individually.
feature are stored. After building a single tree, the class distribution of the samples
in each leaf node is stored and used at prediction time to obtain the conditional class
probability of samples that arrive at that particular leaf node. The leaf node with which
a prediction-sample is associated is determined by comparing the nodes’ split value for
the split feature with the feature vector entry of a sample. Depending on whether the
sample value is smaller (larger) than the node value, the sample is passed to the left
(right) child of the split node, until a leaf node is reached.
Finally, the ensemble of decision tree classifiers is used to calculate the local class
probability of unlabeled pixels by averaging their votes. In our experiments, we achieved
total run-times for training and prediction between one and 5 minutes, depending on
the size of the light field and the number of labels. However, we did not yet parallelize
the local predictions, which is easily possible and would make computation much more
efficient.
6.2.4
Experiments
In this section, we present the results of our multi-label segmentation framework on a
variety of different data sets. To explore the full potential of our approach, we use computer graphics generated light fields rendered with the open source software Blender [77],
which provides complete ground truth for depth values and labels. In addition, we show
that the approach yields very good results on real world data obtained with a plenoptic
camera and a gantry, respectively. A subset of the views in the real-world data sets were
manually labeled in order to provide ground truth to quantify the results.
There are two main benefits of labeling in light fields. First, we demonstrate the
usefulness of disparity as an additional feature for training a classifier, and second, we
show the improvements from the consistent variational multi-label optimization on ray
space.
100
Figure 55: Depth estimated using the method in section 5 and spatial regularizer weight computed
according to equation 63 for the light field view shown in figure 52.
Disparity as a Feature. The first step of the proposed segmentation framework does
not differ from single image segmentation using a random forest. The user selects an
arbitrary view from the light field, adds scribbles for the different labels, and chooses
suitable features as well as the scales on which the features should be calculated. The
classifier is then trained on this single view and, in a second step, used to compute local
class probabilities for all views of the entire light field.
In advance, we have tested variations of common features for interactive image segmentation on our data sets to find a suitable combination of features which yields good
results on single images. The optimal training parameters were determined using a grid
search over the minimum split node size as well as the feature combinations and their
scales for each data set individually. The number of different scales we used for each
feature was fixed to four. This way, we can guarantee optimal results of the random
forest classifier for all data sets and feature combinations, which ensures a meaningful
assessment of the effects of our new ray space features.
Throughout the remainder of this section, we use the three different sets of features
detailed in figure 54. The classifier IMG uses only classical single-view features, while
IMG-D and IMG-GT employ in addition estimated and ground truth disparity, respectively, the latter of course only if available. Estimated disparity maps were obtained
using our method in section 5 and are overall of very good quality, see figure 55. The
achieved accuracy and the boundary recall for purely point-wise classification using the
three classifiers above are listed in the table in figure 56. Sample segmentations for
our data sets can be viewed in figure 58. It is obvious that the features extracted from
the light field improve the quality of a local classifier significantly for difficult problem
instances.
101
IMG
acc
br
Classifier
IMG-D
acc
br
IMG-GT
acc
br
Buddha
93.5
6.4
96.7
39.6
98.6
43.1
Garden
95.1
54.8
96.7
51.1
96.9
53.3
Papillon 1
Papillon 2
98.6
90.8
59.3
16.7
98.3
96.5
57.4
33.1
99.0
99.1
78.9
73.0
Horses 1
Horses 2
93.2
94.6
13.4
15.9
94.3
95.3
34.9
36.8
98.3
98.5
48.7
50.9
StillLife 1
StillLife 2
98.6
97.8
36.3
25.4
98.7
98.3
41.2
36.1
98.9
98.5
45.3
39.1
UCSD [113]
95.8
8.9
97.0
11.2
-
-
Plenoptic 1 [72]
Plenoptic 2 [72]
93.7
91.0
3.5
6.6
94.5
96.1
4.4
8.5
-
-
Data set
synthetic data sets
real-world data sets
Figure 56: Comparison of local labeling accuracy (acc) and boundary recall (br) for all datasets.
The table shows percentages of correctly labeled pixels and boundary pixels, respectively, for pointwise optimal results of the three classifiers trained on the features detailed in figure 54. Disparity for
IMG-D is estimated using the method described in section 5.1.2.1. Ground truth disparity is used
for IMG-GT to determine the maximum possible quality of the proposed method. It is obvious that
in scenes like Buddha, Papillon 2, Horses 2 or StillLife 2, where the user tries to separate objects
with similiar or even identical appearance, the rayspace based feature leads to a large benefit in the
segmentation results.
Global Optimization. In the second set of experiments, we employ our ray space
optimization framework on the results from the local classifier. The unary potentials
in (61) are initialized with the log-probabilities (62) from the local class probabilities,
while the spatial regularization weight g is set to
(63)
g = max 0, 1 − |∇I|2 − H(I) |∇ρ|2 ,
where I denotes the respective single view image, H the Harris corner detector [42],
and ρ the disparity field. This way, we combine the response from three different types
of edge detectors. Experiments showed that the sum of the two different edge signals
for the gray value image I leads to more robust boundary weights. For all of the data
sets, training classifiers with light field features and optimizing over ray space leads
to significantly improved results compared to single view multi-labeling (see figures 57
and 58). The effectiveness of light field segmentation is revealed in particular on data
sets which have highly ambiguous texture and color between classes. In the light field
Buddha, for example, it becomes possible to segment a column from a background wall
having the same texture. In the scene Papillon 2, we demonstrate that it is possible
to separate foreground from background leaves. Similarly, in StillLife 2 we are able to
correctly segment foreground from background raspberries. The data set Horses 2 also
represents a typical case for problems only solvable using the proposed approach. Here,
we perform a labeling of identical objects in the scene with different label classes.
102
103
96.4
99.1
92.3
94.7
96.1
99.1
98.8
Papillon 1
Papillon 2
Horses 1
Horses 2
StillLife 1
StillLife 2
32.2
43.5
34.4
44.3
38.2
47.1
22.3
28.4
34.6
22.4
25.7
43.4
97.9
97.0
96.1
99.1
99.3
98.8
95.7
96.3
99.1
98.1
97.9
97.5
36.3
43.6
33.2
70
50
31.0
23.7
21.4
45.5
46.0
36.6
22.1
IMG+D
acc
imp
Single view (SV)
99.2
–
–
–
99.4
99.1
99.2
99.0
99.3
99.3
98.1
99.1
38.7
–
–
–
45.6
41.1
52.6
31.3
31.7
29.8
39.2
31.2
IMG+GT
acc
imp
96.9
96.8
94.5
97.8
99.2
99.0
94.8
96.2
99.3
93.0
96.9
96.3
37.1
49.5
39.4
48.6
43.8
55.6
23.0
29.5
46.3
24.6
37.0
43.8
IMG
acc
imp
98.7
96.9
96.1
99.3
99.4
98.9
97.7
98.3
99.2
98.9
98.0
98.8
55.9
43.6
33.9
76.3
54.5
38.5
59.4
64.1
50.9
68.2
37.5
63.8
IMG+D
acc
imp
Ray space (RS)
99.4
–
–
–
99.6
99.2
99.2
99.1
99.7
99.5
98.2
99.1
52.0
–
–
–
64.0
45.9
55.6
36.7
65.3
44.7
41.4
35.5
IMG+GT
acc
imp
41.2
12.9
34.6
69.9
31.5
10.1
56.2
56.7
7.9
84.7
43.4
68.2
67.1
–
–
–
53.4
33.6
85.6
76.2
60.1
92.8
49.2
76.0
RS+IMG+GT
vs. SV+IMG
Overall improvement
RS+IMG+D
vs. SV+IMG
Figure 57: Relative improvements by global optimization. All numbers are in percent. The quantities in the columns acc indicate the percentage of correctly labeled pixels. The columns imp denote the relative improvement of the optimized compared to the respective raw result
from the local classifier in figure 56. To be more specific, if accp is the previous and accn the new accuracy, then the column imp contains the
number (accn − accp )/(1 − accp ), i.e. the percentage of previously erroneous pixels which were corrected by optimization. Optimal smoothing
parameters λ, µ were determined using a grid search over the parameter space. We also compare our ray space optimization framework (RS) to
single view optimization (SV), which can be achieved by setting the amount of EPI regularization µ to zero. Note that for every single classifier
and data set, RS optimization achieves a better result than SV optimization. The last two columns indicate the relative accuracy of ray space
optimization and the indicated ray space classifier versus single view optimization and single view features, computed the same way as the imp
columns. In particular, they demonstrate the overall improvement which is realized with the proposed method.
96.8
96.4
94.1
Plenoptic 1
Plenoptic 2
Average
97.6
UCSD
real-world data sets
96.3
Garden
IMG
acc
imp
Buddha
synthetic data sets
Classifier
Optimization
GT labels
SV classifier
RS classifier
SV optimum
RS optimum
Ray-traced data
IMG [56]
IMG+GT [56]
SV+IMG [57]
RS+IMG+GT
[57]
Real-world data
IMG [56]
IMG+D [56]
SV+IMG [57]
RS+IMG+D
[ 57]
Plenoptic 2 Plenoptic 1 UCSD
StillLife 2
Horses 2
Horses 1
Papillon 2
Garden
Buddha
User scribbles
Figure 58: Segmentation results for a number of ray-traced and real-world light fields. GT stands
for ground truth, SV for single view and RS for ray space. The numbers in squared brackets refer to
the corresponding figures. The first two columns on the left show the center view with user scribbles
and ground truth labels. The two middle columns compare classifier results for the local single view
and light field features denoted on top. Since the focus of this paper is segmentation rather than
depth reconstruction, here we show results for ground truth depth where available to compare to the
optimal possible results from light field data. Finally, the two rightmost columns compare the final
results after single view and ray space optimization, respectively. In particular for difficult cases, the
proposed method is significantly superior.
104
7
Conclusion
In this work novel methods for the analysis of 4D light fields was presented. We showed
that a specific parametrization, the so-called Lumigraph, is well suited for an orientation
based analysis. The Lumigraph can be described as a dense collection of pinhole views
captured on a planar regular grid of camera positions. This causes a linear mapping
of 3D points onto lines in the so-called epipolar plane images. We discussed different
devices and techniques to capture light fields as well as the effort necessary to represent
the resulting raw data as a Lumigraph or as epipolar plane images respectively.
In chapter 3, we saw that raw data from a Focused Plenoptic Camera does not provide an
immediate access to epipolar plane images. A method was proposed to render all possible
all-in-focus views from the raw data, which is the desired Lumigraph parametrization.
To avoid a pixel-wise depth estimation within the micro-lens images, we minimized the
gradients at neighboring micro-image patches to render views without plenoptic artifacts.
In chapter 4 we discussed the acquisition techniques relevant for this work in detail. To
generate light fields of best quality we used a high-end consumer camera in combination
with a precise xy-stepper motor. This so-called gantry is ideal to capture very dense
light fields down to baselines of 1mm. The disadvantage is that only static scenes can
be captured. Together with light fields generated using computer graphics, providing
full ground truth data, a benchmark database containing over a dozen simulated and
real world light fields was published during this work (www.lightfield-analysis.net).
In chapter 5, we proposed fast and robust methods, based on an orientation analysis of
epipolar plane images, to compute depth range data. The single orientation analysis
introduced makes use of the structure tensor to analyze epipolar plane images. The
structure tensor analyzes first order derivatives to locally estimate structure and orientation in an image. If the appearance of a 3D point does not depend on the view
point, it is mapped onto a line in an epipolar plane image. However, this approach is
restricted to the Lambertian assumption. If reflections or transparencies are present,
overlayed line patterns arise in the epipolar space the structure tensor cannot handle.
An extension to multi-orientation patterns, making use of a higher order structure
tensor, was proposed. We showed that this multi-orientation analysis leads to much
more robust depth estimation where reflections or semi-transparent materials are present.
In Chapter 6, we discussed two applications of the orientation based depth estimation.
We proposed an angular and spatial super-resolution algorithm based on an energy
minimization framework as well as a framework for optimal label assignment on ray
space for object segmentation. Both methods show the potential of light fields for image
processing and computer vision tasks. The super-resolution framework can be seen as a
proof for the high quality of the depth maps computed using the orientation analysis,
since this method needs disparity estimations of sub-pixel accuracy to work properly. In
the case of object segmentation the benefits are quite obvious. Due to the inherently
105
available depth information within light fields, object detection is getting much more
robust when applied on light fields. We used a standard ”random-forest” classifier in
this work to predict object labels. Compared to predictions on 2D images we were able
to distinguish objects of different classes but similar appearance.
8
Outlook
Possible extensions of the work presented could be the following:
The proposed orientation analysis in this work is still separated in single orientation
and double orientation models and in particular the double orientation model needs
the outcome of the single orientation to interpret the resulting channels. However, this
needs to be unified in a more problem specific manner. The single orientation model is
already included in the second order structure tensor and a more advanced evaluation
of all tensor channels at once would lead to more robust results.
The orientation analysis as described in this work handles 4D light fields as separated
horizontal and vertical 3D light fields merging the outcome in a final step by pixel-wise
choosing the disparity more reliable as the final result. From a computational efficiency
point of view this makes sense, since an evaluation of the 4D data as a whole is quite
expensive, but on the other hand the method described in this work does not make use
of all available information.
Next steps planned for future research are an evaluation of the depth estimation accuracy
on real scenes and extensions of the orientation analysis to light fields varying over time.
Further developments in both are planned, investigating light fields of dynamic scenes
as well as light fields of static scenes under varying illumination conditions.
106
List of Publications
Generating EPI Representations of 4D Light Fields with a Single Lens
Focused Plenoptic Camera
S. Wanner, J. Fehr, B. Jähne
7th International Symposium on Visual Computing, Las Vegas, Sept. 24-26, 2011
Globally Consistent Depth Labeling of 4D Light Fields
S. Wanner, B. Goldlücke
CVPR’12, Providence, Rhode Island, June 16-21, 2012
Spatial and Angular Variational Super-resolution of 4D Light Fields
S. Wanner, B. Goldlücke
ECCV’12, Florence, Italy, October 7-13, 2012
Variational Light Field Analysis for Disparity Estimation and SuperResolution
S. Wanner, B. Goldlücke
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013
Globally Consistent Multi-Label Assignment on the Ray Space of 4D
Light Fields
S. Wanner, C. Straehle, B. Goldlücke
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013
The Variational Structure of Disparity and Regularization of 4D Light
Fields
B. Goldlücke, S. Wanner
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013
Datasets and Benchmarks for Densely Sampled 4D Light Fields
S. Wanner, S. Meister, B. Goldlücke
Vision, Modelling and Visualization (VMV), 2013
Reconstructing Reflective and Transparent Surfaces from Epipolar Plane
Images
S. Wanner and B. Goldlücke
German Conference on Pattern Recognition (GCPR), 2013
107
108
Bibliography
[1] Aach, T., Mota, C., Stuke, I., Muehlich, M., and Barth, E. (2006). Analysis of
superimposed oriented patterns. IEEE Transactions on Image Processing, 15(12),
3690–3700.
[2] Adelson, E. and Bergen, J. (1991). The plenoptic function and the elements of early
vision. Computational models of visual processing, 1.
[3] Adelson, E. H. and Wang, J. Y. (1992). Single lens stereo with a plenoptic camera.
IEEE transactions on pattern analysis and machine intelligence, 14(2), 99–106.
[4] Alldrin, N., Zickler, T., and Kriegman, D. (2008). Photometric Stereo With NonParametric and Spatially-Varying Reflectance. In Proc. International Conference on
Computer Vision and Pattern Recognition.
[5] Ashdown, I. (1993). Near-field photometry: A new approach.
ILLUMINATING ENGINEERING SOCIETY , 22, 163–163.
JOURNAL-
[6] Baker, H. H. (1989). Building surfaces of evolution: The weaving wall. International
Journal of Computer Vision, 3(1), 51–71.
[7] Baker, H. H. and Bolles, R. C. (1989). Generalizing epipolar-plane image analysis on
the spatiotemporal surface. International Journal of Computer Vision, 3(1), 33–49.
[8] Baker, S. and Kanade, T. (2002). Limits on Super-Resolution and How to Break
Them. IEEE Trans. on Pattern Analysis and Machine Intelligence, 24(9), 1167–1183.
[9] Beck, A. and Teboulle, M. (2009). Fast Iterative Shrinkage-Thresholding Algorithm
for Linear Inverse Problems. SIAM J. Imaging Sciences, 2, 183–202.
[10] Benton, S. A. (1983). Survey of holographic stereograms. In 26th Annual Technical
Symposium, pages 15–19. International Society for Optics and Photonics.
[11] Berent, J. and Dragotti, P. (2006). Segmentation of epipolar-plane image volumes
with occlusion and disocclusion competition. In IEEE 8th Workshop on Multimedia
Signal Processing, pages 182–185.
[12] Bigün, J. and Granlund, G. H. (1987). Optimal orientation detection of linear
symmetry. In Proc. International Conference on Computer Vision, pages 433–438.
[13] Bishop, T. and Favaro, P. (2011). Full-resolution depth map estimation from an
aliased plenoptic light field. Computer Vision–ACCV 2010 , pages 186–200.
109
[14] Bishop, T. and Favaro, P. (2012). The Light Field Camera: Extended Depth of
Field, Aliasing, and Superresolution. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 34(5), 972–986.
[15] Bishop, T. E. and Favaro, P. (2009). Plenoptic depth estimation from multiple
aliased views. In Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th
International Conference on, pages 1622–1629. IEEE.
[16] Bolles, R., Baker, H., and Marimont, D. (1987). Epipolar-plane image analysis: An
approach to determining structure from motion. International Journal of Computer
Vision, 1(1), 7–55.
[17] Bradski, G. (2000). The OpenCV Library. Dr. Dobb’s Journal of Software Tools.
[18] Breiman, L. (2001). Random forests. Machine learning, 45(1), 5–32.
[19] Chai, J.-X., Tong, X., Chany, S.-C., and Shum, H.-Y. (2000). Plenoptic sampling.
Proc. SIGGRAPH , pages 307–318.
[20] Chambolle, A. (2004). An algorithm for total variation minimization and applications. Journal of Mathematical Imaging and Vision, 20(1-2), 89–97.
[21] Chambolle, A. and Pock, T. (2011). A First-Order Primal-Dual Algorithm for
Convex Problems with Applications to Imaging. J. Math. Imaging Vis., 40(1),
120–145.
[22] Chambolle, A., Cremers, D., and Pock, T. (2008). A Convex Approach for Computing Minimal Partitions. Technical Report TR-2008-05, Dept. of Computer Science,
University of Bonn.
[23] Chang, C.-F., Bishop, G., and Lastra, A. (1999). Ldi tree: A hierarchical representation for image-based rendering. In Proceedings of the 26th annual conference on
Computer graphics and interactive techniques, pages 291–298. ACM Press/AddisonWesley Publishing Co.
[24] Chen, S. E. and Williams, L. (1993). View interpolation for image synthesis. In
Proceedings of the 20th annual conference on Computer graphics and interactive
techniques, pages 279–288. ACM.
[25] Chunev, G., Lumsdaine, A., and Georgiev, T. (2011). Plenoptic rendering with
interactive performance using gpus. In SPIE Electronic Imaging.
[26] Criminisi, A. (2011). Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. Foundations
R in Computer Graphics and Vision, 7(2-3), 81–227.
and Trends
[27] Criminisi, A., Kang, S., Swaminathan, R., Szeliski, R., and Anandan, P. (2005).
Extracting layers and analyzing their specular properties using epipolar-plane-image
analysis. Computer vision and image understanding, 97(1), 51–85.
110
[28] Dansereau, D. G., Pizarro, O., and Williams, S. B. (2013). Decoding, calibration
and rectification for lenselet-based plenoptic cameras.
[29] Davis, J., Yang, R., and Wang, L. (2005). BRDF Invariant Stereo using Light
Transport Constancy. In Proc. International Conference on Computer Vision.
[30] DeGroot, M. H. (2005). Optimal statistical decisions, volume 82. Wiley-Interscience.
[31] Esedoglu, S. and March, R. (2003). Segmentation with Depth but Without Detecting
Junctions. Journal of Mathematical Imaging and Vision, 18(1), 7–15.
[32] Federer, H. (1969). Geometric measure theory. Springer-Verlag New York Inc.,
New York.
[33] Georgiev, T. and Lumsdaine, A. (2010a). Focused plenoptic camera and rendering.
Journal of Electronic Imaging, 19(2), 021106–021106.
[34] Georgiev, T. and Lumsdaine, A. (2010b). Reducing plenoptic camera artifacts. In
Computer Graphics Forum. Wiley Online Library.
[35] Geys, I., Koninckx, T. P., and Gool, L. V. (2004). Fast interpolated cameras by
combining a GPU based plane sweep with a max-flow regularisation algorithm. In
3DPVT , pages 534–541.
[36] Gokturk, S. B., Yalcin, H., and Bamji, C. (2004). A time-of-flight depth sensorsystem description, issues and solutions. In Computer Vision and Pattern Recognition
Workshop, 2004. CVPRW’04. Conference on, pages 35–35. IEEE.
[37] Goldluecke, B. (2013). cocolib - a library for continuous convex optimization.
http://cocolib.net.
[38] Goldluecke, B. and Cremers, D. (2009). Superresolution Texture Maps for Multiview
Reconstruction. In Proc. International Conference on Computer Vision.
[39] Goldluecke, B. and Wanner, S. (2013). The Variational Structure of Disparity and
Regularization of 4D Light Fields. In Proc. International Conference on Computer
Vision and Pattern Recognition.
[40] Goldman, D., Curless, B., Hertzmann, A., and Seitz, S. (2010). Shape and SpatiallyVarying BRDFs from Photometric Stereo. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 32(6), 1060–1071.
[41] Gortler, S., Grzeszczuk, R., Szeliski, R., and Cohen, M. (1996). The Lumigraph.
In Proc. SIGGRAPH , pages 43–54.
[42] Harris, C. and Stephens, M. (1988). A combined corner and edge detector. In Alvey
vision conference, volume 15, page 50. Manchester, UK.
111
[43] Hirschmuller, H. and Scharstein, D. (2007). Evaluation of cost functions for stereo
matching. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE
Conference on, pages 1–8. IEEE.
[44] Ives, F. (1903). Patent us 725,567.
[45] Jaehne, B. (2005). Digitale Bildverarbeitung. Springer DE.
[46] Jin, H., Soatto, S., and Yezzi, A. (2005). Multi-View Stereo Reconstruction of
Dense Shape and Complex Appearance. International Journal of Computer Vision,
63(3), 175–189.
[47] Joshi, N., Matusik, W., and Avidan, S. (2006). Natural video matting using camera
arrays. In ACM Transactions on Graphics (TOG), volume 25, pages 779–786. ACM.
[48] Kang, S. B. (1997). A survey of image-based rendering techniques. Digital, Cambridge Research Laboratory.
[49] Koethe, U. (2012). The vigra computer vision library version 1.9.0 http://hci.iwr.uniheidelberg.de/vigra/.
[50] Kowdle, A., Sinha, S., and Szeliski, R. (2012). Multiple View Object Cosegmentation
using Appearance and Stereo Cues. In Proc. European Conference on Computer
Vision.
[51] Kubota, A., Aizawa, K., and Chen, T. (2007). Reconstructing Dense Light Field
From Array of Multifocus Images for Novel View Synthesis. IEEE Transactions on
Image Processing, 16(1), 269–279.
[52] Lazaros, N., Sirakoulis, G. C., and Gasteratos, A. (2008). Review of stereo vision
algorithms: from software to hardware. International Journal of Optomechatronics,
2(4), 435–462.
[53] Lellmann, J., Becker, F., and Schnörr, C. (2009). Convex Optimization for MultiClass Image Labeling with a Novel Family of Total Variation Based Regularizers. In
IEEE International Conference on Computer Vision (ICCV).
[54] Lengyel, J. (1998). The convergence of graphics and vision. Computer , 31(7),
46–53.
[55] Levin, A. and Weiss, Y. (2007). User Assisted Separation of Reflections from a
Single Image Using a Sparsity Prior. IEEE Transactions on Pattern Analysis and
Machine Intelligence.
[56] Levin, A., Zomet, A., and Weiss, Y. (2004). Separating Reflections from a Single
Image Using Local Features. In Proc. International Conference on Computer Vision
and Pattern Recognition.
112
[57] Levoy, M. (2006). Light fields and computational imaging. Computer , 39(8), 46–55.
[58] Levoy, M. and Hanrahan, P. (1996a). Light field rendering. In Proceedings of the
23rd annual conference on Computer graphics and interactive techniques, pages 31–42.
ACM.
[59] Levoy, M. and Hanrahan, P. (1996b). Light field rendering. In Proc. SIGGRAPH ,
pages 31–42.
[60] Levoy, M., Ng, R., Adams, A., Footer, M., and Horowitz, M. (2006). Light field
microscopy. ACM Transactions on Graphics (TOG), 25(3), 924–934.
[61] Lippmann, G. (1908). Epreuves reversibles donnant la sensation du relief. J. Phys.
Theor. Appl., 7(1), 821–825.
[62] Lumsdaine, A. and Georgiev, T. (2008). Full resolution lightfield rendering. Technical report.
[63] Lumsdaine, A. and Georgiev, T. (2009a). The focused plenoptic camera. In In
Proc. IEEE ICCP , pages 1–8.
[64] Lumsdaine, A. and Georgiev, T. (2009b). The Focused Plenoptic Camera. In In
Proc. IEEE International Conference on Computational Photography, pages 1–8.
[65] Mark, W. R., McMillan, L., and Bishop, G. (1997). Post-rendering 3d warping. In
Proceedings of the 1997 symposium on Interactive 3D graphics, pages 7–ff. ACM.
[66] Marwah, K., Wetzstein, G., Bando, Y., and Raskar, R. (2013). Compressive Light
Field Photography using Overcomplete Dictionaries and Optimized Projections. ACM
Trans. Graph. (Proc. SIGGRAPH), 32(4), 1–11.
[67] Matoušek, M., Werner, T., and Hlavác, V. (2001). Accurate correspondences from
epipolar plane images. In Proc. Computer Vision Winter Workshop, pages 181–189.
[68] Meister, S. (2014). On Creating Reference Data for Performance Analysis in Image
Processing. Ph.D. thesis, University of Heidelberg.
[69] Ng, R. (2006). Digital Light Field Photography. Ph.D. thesis, Stanford University.
Note: thesis led to commercial light field camera, see also www.lytro.com.
[70] Ng, R., Levoy, M., Brédif, M., Duval, G., Horowitz, M., and Hanrahan, P. (2005).
Light field photography with a hand-held plenoptic camera. Technical Report CSTR
2005-02, Stanford University.
[71] Nickolls, J., Buck, I., Garland, M., and Skadron, K. (2008). Scalable parallel
programming with cuda. Queue, 6(2), 40–53.
[72] Perwass, C. and Wietzke, L. (2010).
www.raytrix.de.
113
The next generation of photography,
[73] Perwass, C. and Wietzke, L. (2012). Single lens 3d-camera with extended depth-offield. In SPIE Electronic Imaging 2012 , pages 22–26.
[74] Pock, T., Cremers, D., Bischof, H., and Chambolle, A. (2010). Global Solutions of
Variational Models with Convex Regularization. SIAM Journal on Imaging Sciences.
[75] Ponce, J., Forsyth, D., Willow, E.-p., Antipolis-Méditerranée, S., d’activité RAweb,
R., Inria, L., and Alumni, I. (2011). Computer vision: a modern approach. Computer ,
16, 11.
[76] Protter, M. and Elad, M. (2009). Super-Resolution With Probabilistic Motion
Estimation. IEEE Transactions on Image Processing, 18(8), 1899–1904.
[77] Roosendaal, T. (1998). Blender, http://www.blender.org.
[78] Rudin, L. I., Osher, S., and Fatemi, E. (1992). Nonlinear total variation based
noise removal algorithms. Physica D: Nonlinear Phenomena, 60(1), 259–268.
[79] Ruiters, R. and Klein, R. (2009). Heightfield and spatially varying BRDF Reconstruction for Materials with Interreflections. Computer Graphics Forum (Proc.
Eurographics), 28(2), 513–522.
[80] Sard, A. (1942). The measure of the critical values of differentiable maps. Bull.
Amer. Math. Soc, 48(12), 883–890.
[81] Scharr, H. (2000). Optimale operatoren in der digitalen bildverarbeitung.
[82] Scharstein, D. and Pal, C. (2007). Learning conditional random fields for stereo. In
Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on,
pages 1–8. IEEE.
[83] Scharstein, D. and Szeliski, R. (2002). A taxonomy and evaluation of dense twoframe stereo correspondence algorithms. International Journal of Computer Vision,
47, 7–42.
[84] Scharstein, D. and Szeliski, R. (2003). High-accuracy stereo depth maps using
structured light. In Computer Vision and Pattern Recognition, 2003. Proceedings.
2003 IEEE Computer Society Conference on, volume 1, pages I–195. IEEE.
[85] Seitz, S. M. and Dyer, C. R. (1997). View morphing: Uniquely predicting scene
appearance from basis images. In Proc. Image Understanding Workshop, pages
881–887.
[86] Shade, J., Gortler, S., He, L.-w., and Szeliski, R. (1998). Layered depth images.
In Proceedings of the 25th annual conference on Computer graphics and interactive
techniques, pages 231–242. ACM.
[87] Shan, J. and Toth, C. K. (2008). Topographic laser ranging and scanning: principles
and processing. CRC Press.
114
[88] Shum, H. and Kang, S. B. (2000). Review of image-based rendering techniques. In
VCIP , pages 2–13. Citeseer.
[89] Shum, H.-Y. and He, L.-W. (1999). Rendering with concentric mosaics. In
Proceedings of the 26th annual conference on Computer graphics and interactive
techniques, pages 299–306. ACM Press/Addison-Wesley Publishing Co.
[90] Sinha, S., Kopf, J., Goesele, M., Scharstein, D., and Szeliski, R. (2012). Image-Based
Rendering for Scenes with Reflections. ACM Transactions on Graphics (Proc. SIGGRAPH), 31(4), 100:1–100:10.
[91] Smith, O. (1961). Eigenvalues of a symmetric 3x3 matrix. Communications of the
ACM , 4(4), 168.
[92] Stein, A., Hoiem, D., and Hebert, M. (2007). Learning to Find Object Boundaries
Using Motion Cues. In Proc. International Conference on Computer Vision.
[93] Strekalovskiy, E. and Cremers, D. (2011). Generalized Ordering Constraints for
Multilabel Optimization. In Proc. International Conference on Computer Vision.
[94] Strekalovskiy, E., Goldluecke, B., and Cremers, D. (2011). Tight Convex Relaxations
for Vector-Valued Labeling Problems. In Proc. International Conference on Computer
Vision.
[95] The HDF Group, http://www.hdfgroup.org/HDF5 (2000-2010). Hierarchical data
format version 5.
[96] Tsin, Y., Kang, S., and Szeliski, R. (2006). Stereo Matching with Linear Superposition of Layers. IEEE Transactions on Pattern Analysis and Machine Intelligence,
28(2), 290–301.
[97] Vaish, V., Wilburn, B., Joshi, N., and Levoy, M. (2004). Using plane+ parallax
for calibrating dense camera arrays. In Computer Vision and Pattern Recognition,
2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on,
volume 1, pages I–2. IEEE.
[98] van Rossum, G. and Drake (eds), F. (2001). Python reference manual, pythonlabs,
virginia, usa, 2001 available at http://www.python.org.
[99] Wang, Z. and Bovik, A. (2009). Mean Squared Error: Love it or Leave it? IEEE
Signal Processing Magazine, 26(1), 98–117.
[100] Wanner, S. and Goldluecke, B. (2012a). Globally consistent depth labeling of
4D light fields. In Proc. International Conference on Computer Vision and Pattern
Recognition, pages 41–48.
[101] Wanner, S. and Goldluecke, B. (2012b). Spatial and angular variational superresolution of 4D light fields. In Proc. European Conference on Computer Vision.
115
[102] Wanner, S. and Goldluecke, B. (2013a). Reconstructing reflective and transparent
surfaces from epipolar plane images. In Pattern Recognition, pages 1–10. Springer.
[103] Wanner, S. and Goldluecke, B. (2013b). Variational Light Field Analysis for
Disparity Estimation and Super-Resolution. IEEE Transactions on Pattern Analysis
and Machine Intelligence.
[104] Wanner, S., Fehr, J., and Jähne, B. (2011). Generating EPI representations of
4D Light fields with a single lens focused plenoptic camera. Advances in Visual
Computing, pages 90–101.
[105] Wanner, S., Meister, S., and Goldluecke, B. (2013a). Datasets and benchmarks for
densely sampled 4d light fields. In Vision, Modeling & Visualization, pages 225–226.
The Eurographics Association.
[106] Wanner, S., Straehle, C., and Goldluecke, B. (2013b). Globally Consistent MultiLabel Assignment on the Ray Space of 4D Light Fields. In Proc. International
Conference on Computer Vision and Pattern Recognition.
[107] Wetzstein, G., Ihrke, I., Lanman, D., and Heidrich, W. (2011). Computational
plenoptic imaging. In Computer Graphics Forum, volume 30, pages 2397–2426. Wiley
Online Library.
[108] Wilburn, B., Joshi, N., Vaish, V., Talvala, E.-V., Antunez, E., Barth, A., Adams,
A., Horowitz, M., and Levoy, M. (2005a). High performance imaging using large
camera arrays. ACM Trans. Graph., 24, 765–776.
[109] Wilburn, B., Joshi, N., Vaish, V., Talvala, E.-V., Antunez, E., Barth, A., Adams,
A., Horowitz, M., and Levoy, M. (2005b). High performance imaging using large
camera arrays. ACM Transactions on Graphics, 24, 765–776.
[110] Zach, C., Gallup, D., Frahm, J.-M., and Niethammer, M. (2008). Fast Global
Labeling for Real-Time Stereo Using Multiple Plane Sweeps. In Vision, Modeling and
Visualization Workshop VMV 2008 .
[111] Zach, C., Niethammer, M., and Frahm, J.-M. (2009). Continuous Maximal Flows
and Wulff Shapes: Application to MRFs. In Proc. International Conference on
Computer Vision and Pattern Recognition.
[112] Zickler, T., Belhumeur, P., and Kriegman, D. (2002). Helmholtz Stereopsis:
Exploiting Reciprocity for Surface Reconstruction. International Journal of Computer
Vision, 49(2–3), 215–227.
[113] Zwicker, M., Matusik, W., Durand, F., Pfister, H., and Forlines, C. (2006).
Antialiasing for automultiscopic 3D displays. In ACM Transactions on Graphics
(Proc. SIGGRAPH), page 107. ACM.
116
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement