COMPRESSIVE IMAGING FOR DIFFERENCE IMAGE FORMATION AND WIDE-FIELD-OF-VIEW TARGET TRACKING

COMPRESSIVE IMAGING FOR DIFFERENCE IMAGE FORMATION AND WIDE-FIELD-OF-VIEW TARGET TRACKING
COMPRESSIVE IMAGING FOR DIFFERENCE IMAGE
FORMATION AND WIDE-FIELD-OF-VIEW TARGET
TRACKING
by
Shikhar
Copyright
© Shikhar 2010
A Dissertation Submitted to the Faculty of the
DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING
In Partial Fulfillment of the Requirements
For the Degree of
DOCTOR OF PHILOSOPHY
In the Graduate College
THE UNIVERSITY OF ARIZONA
2010
2
THE UNIVERSITY OF ARIZONA
GRADUATE COLLEGE
As members of the Dissertation Committee, we certify that we have read the dissertation
prepared by Shikhar
entitled Compressive Imaging for Difference Image Formation and Wide-Field-ofView Target Tracking
and recommend that it be accepted as fulfilling the dissertation requirement for the
Degree of Doctor of Philosophy.
Date: September 07, 2010
Dr. Nathan A. Goodman
Date: September 07, 2010
Dr. Michael Gehm
Date: September 07, 2010
Dr. Bane Vasić
Date: September 07, 2010
Date: September 07, 2010
Final approval and acceptance of this dissertation is contingent upon the candidate’s
submission of the final copies of the dissertation to the Graduate College.
I hereby certify that I have read this dissertation prepared under my direction and
recommend that it be accepted as fulfilling the dissertation requirement.
Date: September 07, 2010
Dissertation Director: Dr. Nathan A. Goodman
3
STATEMENT BY AUTHOR
This dissertation has been submitted in partial fulfillment of requirements for an
advanced degree at the University of Arizona and is deposited in the University
Library to be made available to borrowers under rules of the Library.
Brief quotations from this dissertation are allowable without special permission,
provided that accurate acknowledgment of source is made. Requests for permission
for extended quotation from or reproduction of this manuscript in whole or in part
may be granted by the copyright holder.
SIGNED: Shikhar
4
ACKNOWLEDGEMENTS
I am immensely grateful to my advisor, Dr. Nathan Goodman, for being an exceptional mentor. His guidance, patience and support have greatly helped me in my
research work. Discussions and talks with him on varied research topics have taught
me how to approach new research problems. For that, I am beholden to him. My
warm thanks to Hyoung-soo Kim, Junhyeong Bae and Ric Romero, with whom I
shared our lab over the course of my PhD. The lab environment they created was
a joy to work in. We shared many experiences, and also a great many laughs over
the idiosyncracies of the research world. It has been my pleasure to be their friend
and colleague.
A very special thanks to Dr. Mark Neifeld for his helpful insights and suggestions
regarding my research work. They greatly helped me in achieving my research
objectives. It was a tremendous learning experience to work with him, and I am
thankful to him for his guidance and help. Sincere thanks also to Dr. Michael
Gehm and Dr. Bane Vasić for taking time away from their busy schedule to be on
my committee. They both have always been available whenever I needed their help
in research or other matters. I am very grateful.
Finally, I would like to thank my friends and family, especially my Dad and my
sister, for their love, encouragement and support. This dissertation would not have
possible without them.
5
DEDICATION
... to Papa and Sansu ...
6
TABLE OF CONTENTS
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
CHAPTER 1 INTRODUCTION . . . . . . . . . . .
1.1 Compressive Imaging . . . . . . . . . . . . . .
1.2 Compressed Sensing . . . . . . . . . . . . . .
1.3 Our Contribution . . . . . . . . . . . . . . . .
1.3.1 Optical Multiplexing For Superposition
1.3.2 Feature-Specific Difference Imaging . .
1.4 Dissertation Outline . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
Space Tracking
. . . . . . . . .
. . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12
14
18
24
25
29
33
CHAPTER 2 OPTICAL ARCHITECTURES . . . . . . . . . . . . . . . . . 34
2.1 Optical Multiplexing . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2 Compressive Feature Specific Imagers . . . . . . . . . . . . . . . . . . 43
CHAPTER 3 SUPERPOSITION SPACE TRACKING . . . . . . .
3.1 Sub-FOV Superposition and Encoding . . . . . . . . . . . .
3.1.1 2-D Spatial, Rotational and Magnification Encodings
3.2 Decoding: Proof of Concept . . . . . . . . . . . . . . . . . .
3.2.1 Correlation-based Tracking . . . . . . . . . . . . . . .
3.2.2 Decoding Procedure for a Single Target . . . . . . . .
3.2.3 Decoding Procedure for Multiple Targets . . . . . . .
3.2.4 Decoding via Missing Ghosts . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
49
50
56
59
60
69
73
76
CHAPTER 4 SUPERPOSITION SPACE TRACKING RESULTS
4.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Average Decoding Time and Area Coverage Efficiency (η)
4.3 Probability of Decoding Error . . . . . . . . . . . . . . . .
4.3.1 Experimental Results . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
80
80
87
91
95
CHAPTER 5 FEATURE-SPECIFIC DIFFERENCE IMAGING
5.1 Linear Reconstruction . . . . . . . . . . . . . . . . . . .
5.1.1 Data Model . . . . . . . . . . . . . . . . . . . . .
5.1.2 Indirect Image Reconstruction . . . . . . . . . . .
5.1.3 Direct Difference Image Estimation . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
98
99
99
100
102
.
.
.
.
.
7
TABLE OF CONTENTS – Continued
5.2
5.1.4 DDIE: Noise Absent
5.1.5 DDIE: Noise Present
5.1.6 Sensing Matrices . .
5.1.7 Multi-step DDIE and
Non-linear Reconstruction .
. . . . . .
. . . . . .
. . . . . .
LFGDIE
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
103
105
106
113
118
CHAPTER 6 FEATURE-SPECIFIC DIFFERENCE IMAGING RESULTS . 125
6.1 ℓ2 -based Difference Image Estimation . . . . . . . . . . . . . . . . . . 128
6.2 ℓ1 -based Difference Image Estimation . . . . . . . . . . . . . . . . . . 135
CHAPTER 7 CONCLUSION AND FUTURE WORK . . . . . . . . . . . . 144
APPENDIX A DERIVATIONS FOR NOISE ABSENT CASE . . . . . . . . 148
APPENDIX B LFGDIE AND DDIE ESTIMATION OPERATORS . . . . . 151
APPENDIX C WATERFILLING SOLUTION . . . . . . . . . . . . . . . . . 154
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8
LIST OF FIGURES
1.1
1.2
1.3
Conventional and FS imagers . . . . . . . . . . . . . . . . . . . . . . 16
Optical multiplexing . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Difference image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.1
2.2
2.3
2.4
2.5
Beamsplitter and mirror based optical system
Binary combiner . . . . . . . . . . . . . . . .
Sequential feature specific imager . . . . . . .
Parallel feature specific imager . . . . . . . . .
Photon sharing feature specific imager . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
36
39
45
45
48
3.1
3.2
3.3
3.4
3.5
3.6
Object, superposition and hypothesis spaces
Sub-FOV overlap and separation distance . .
Encoding schemes . . . . . . . . . . . . . . .
Ambiguous target decoding scenarios . . . .
Error in target decoding . . . . . . . . . . .
Decoding via missing ghosts . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
52
55
56
72
77
78
4.1
4.2
4.2
4.2
4.3
4.4
4.5
Examples of valid target patterns . . . . . . . . . . . .
Decoding 4 targets . . . . . . . . . . . . . . . . . . . .
Decoding 4 targets contd. . . . . . . . . . . . . . . . .
Decoding 4 targets contd. . . . . . . . . . . . . . . . .
Plot of decoding time as a function of the area coverage
Plot of decoding error vs. area coverage efficiency η . .
Experimental data performance . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
efficiency η
. . . . . . .
. . . . . . .
.
.
.
.
.
.
.
81
83
84
85
88
94
97
5.1
5.2
5.3
Direct and indirect reconstruction . . . . . . . . . . . . . . . . . . . . 103
Stochastic Tunneling . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Multi-step direct difference image estimation . . . . . . . . . . . . . . 114
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
ℓ2 -based estimated difference image . . . . . . . . . . . .
RMSE vs. SNR plots for 8 × 8 block size . . . . . . . . .
RMSE vs. M for SNR = 20 dB for 8 × 8 block size . . .
Single step vs. multi-step DDIE . . . . . . . . . . . . . .
RMSE vs. SNR plots for LFGDIE method . . . . . . . .
RMSE vs. SNR performance plots for block size 16 × 16
Optimal sensing matrix performance . . . . . . . . . . .
ℓ1 -based estimated difference image . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
127
129
130
133
134
135
136
137
9
LIST OF FIGURES – Continued
6.9
6.10
6.11
6.12
RMSE vs. SNR curves for 8 × 8 blocks, for the TV method .
ℓ1 -based WDPC sensing matrix RMSE vs. SNR curves . . .
RMSE vs. M per block, for 8 × 8 blocks, for the TV method
TV method performance for different sensing matrix . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
139
141
142
143
10
ABSTRACT
Use of imaging systems for performing various situational awareness tasks in military and commercial settings has a long history. There is increasing recognition,
however, that a much better job can be done by developing non-traditional optical systems that exploit the task-specific system aspects within the imager itself. In
some cases, a direct consequence of this approach can be real-time data compression
along with increased measurement fidelity of the task-specific features. In others,
compression can potentially allow us to perform high-level tasks such as direct tracking using the compressed measurements without reconstructing the scene of interest.
In this dissertation we present novel advancements in feature-specific (FS) imagers
for large field-of-view surveillence, and estimation of temporal object-scene changes
utilizing the compressive imaging paradigm. We develop these two ideas in parallel.
In the first case we show a feature-specific (FS) imager that optically multiplexes
multiple, encoded sub-fields of view onto a common focal plane. Sub-field encoding
enables target tracking by creating a unique connection between target characteristics in superposition space and the target’s true position in real space. This is
accomplished without reconstructing a conventional image of the large field of view.
System performance is evaluated in terms of two criteria: average decoding time
11
and probability of decoding error. We study these performance criteria as a function of resolution in the encoding scheme and signal-to-noise ratio. We also include
simulation and experimental results demonstrating our novel tracking method. In
the second case we present a FS imager for estimating temporal changes in the object scene over time by quantifying these changes through a sequence of difference
images. The difference images are estimated by taking compressive measurements
of the scene. Our goals are twofold. First, to design the optimal sensing matrix
for taking compressive measurements. In scenarios where such sensing matrices are
not tractable, we consider plausible candidate sensing matrices that either use the
available a priori information or are non-adaptive. Second, we develop closed-form
and iterative techniques for estimating the difference images. We present results to
show the efficacy of these techniques and discuss the advantages of each.
12
CHAPTER 1
INTRODUCTION
The history of optics is long and eventful, going back to Assyrians in seventh century B.C. Mesopotamia [1]. In the intervening centuries development of optics went
through two major advancements. The first was the development of man-made optical elements such as lenses to enhance human vision. The second was the recording
of image formation on a photochemical film. Both these advancements have had
immense impact on various disciplines such as medicine (mammography), astronomy (telescope) and photography to name a few, and on modern life in general.
We are now in the era of the third major advancement, the replacement of photochemical films with opto-electronic sensors, and the ubiquitous use of digital signal
processing [2].
Until recently, the imaging and the signal processing components of the optical
system were treated as separate. The goal of the imaging components was to image
the object scene of interest. Towards this end imaging components were optimized
using optical methods to improve resolution, reduce aberrations and in general obtain a good quality image of the object scene. Signal and image processing methods,
along with higher level computer vision techniques, were then used to remove effects
13
of sensors and residual optical aberrations to produce an improved image, and to
extract relevant image features to accomplish an analysis or situational awareness
task at hand. To be sure, this is still a major area of research. However, Hounsfield’s
development of the X-ray tomographic scanner in 1972, and the development of nuclear magnetic resonance spectroscopic techniques in the later half of the twentieth
century – which eventually led to the invention of magnetic resonanace imaging
(MRI) during the 1970s – has influenced a growing trend of optical design informed
by digital signal processing ideas. More and more imaging and signal processing
components of the system are being jointly designed to exploit the extra degrees of
freedom for better performance [2], [3]. This is the realm of computational imaging.
The key to computational imaging is the discrete representation of data, which
is inherent in the use of opto-electronic sensors. This discrete analytical approach
allows us to think of computational imaging as mapping discretized object space
data to some measurement space. Depending on the nature of the mapping we
get three broad areas in computational imaging [2]: 1. Isomorphic or one-to-one
mapping resulting in measurement space being an image of the object space. 2.
Dimension-preserving mapping, such as Fourier transform, where the dimensions of
the space in which the measurements are embedded and the dimensions of the native
object space is the same. 3. Discrete mapping where the object data is linearly or
non-linearly projected on to some measurement space, with no assumptions about
14
the underlying embedded space. It is in the context of this last category, that we
define compressive imaging.
1.1 Compressive Imaging
Compressive imaging involves linearly mapping a higher-dimensional object space to
a lower-dimensional measurement space leading to real-time compression of data.1
Furthermore, the mapping is based on some task-specific metric and designed to
improve measurement fidelity and system performance. This approach can be contrasted with the traditional approach of conventional imaging where the goal is
to obtain a high-fidelity picture and then extract relevant information as a postprocessing step. Figure 1.1 illustrates the contrast between the two approaches.
Figure 1.1(a) shows the conventional imaging paradigm where a traditional optical
system images the object scene of interest onto a film or an optical sensor. (In
keeping with the discrete representation we assume it to be a photodetector.) The
image is then processed for the task at hand. For example, in face recognition,
independent features2 can be extracted from a set of training images, imaged using
a traditional camera. The point of note here is that imaging and computational
image processing are implemented as two distinct processes cascaded together. A
1
We make the notion of compression precise in Section 2.2.
2
The independent features are computed using independent component analysis technique.
15
considerable work has been done to separately improve the performance of each
of these two stages. This, however, has now begun to change. The idea depicted
in Fig. 1.1(b) is to replace the conventional imager with the feature-specific (FS)
imager, which instead of imaging the object scene, measures features of the object
scene specific to a particular task at hand. Taking the face recognition example,
the FS imager directly measures the independent facial features. By controlling the
number of features measured as a function of the performance of the face recognition
task at hand, it turns out that we need to measure fewer features than the native
dimensionality of the object scene, resulting in a real-time compression of data [4].
This directly results in reduced sensor costs. The second, and far more surprising benefit is the improved measurement fidelity,3 and consequently, improved face
recognition performance, due to the incidence of the same number of object scene
photons as in the conventional scenario on a smaller set of photo-detectors.
The history of FS imager-based compressive imaging is quite recent with its beginnings lying in this past decade. The work by Neifeld and Shankar [5] was the
first formalization of this idea. The work itself was motivated by earlier work in
computational imaging related to hardware-software co-design [3], development of
information metrics [6], [7], [8], [9], and non-traditional [10] and novel imaging systems [11], [12], [13]. Compressive imaging since then has been used for designing a
3
This is the detector-noise-limited fidelity.
16
(a) Conventional imager
(b) FS imager
Figure 1.1: Conventional and FS imagers
face recognition optical system [4], and FS imagers using spatial structured light.
FS imaging with structured illumination can give 40% improvement in measurement fidelity, while at the same achieving compression by a factor of 400 [14]. A
number of compressive imaging architectures have also been proposed with their
advantages and disadvantages studied in detail [15]. The key to this historical development, as must have been deduced by the reader, has been the utilization of
knowledge-enhanced measurements, that is, incorporating a priori knowledge to
design compressive FS imaging systems. For example, if there is a need for data
compression, then the measured features can be wavelet features. On the other
hand, if data encoding/decoding is required then Hadamard features can be used.
A closely related approach to FS compressive imaging, which has gained wide-
17
usage, stresses on making non-adaptive compressive scene measurements. This approach is referred to as compressed sensing (CS). The thrust of CS is slightly different
to FS compressive imaging. Here the focus is on designing a universal imager that
does not utilize any a priori information about the scene, except for requiring the
scene be sparse in some representation basis. One of its major successes has been
the development of a single pixel camera [16]. CS ideas have also been used in various other fields such as statistics [17], coding theory [18] and computer science [19].
In the context of, compressive imaging, however, CS has mainly focussed on image
reconstruction. FS compressive imaging, on the other hand, as will be shown in
this work, can be directly employed to perform situational awareness tasks without
reconstructing the scene. Secondly, though CS performs better for a task without
available a priori information, its performance is sub-optimal (as will be shown in
this work) when such information is available, which is generally the case in many
practical applications. CS, however, has an elegant mathematical rigour to it, drawing its ideas from mathematical analysis and algebra. Because of this reason, and its
closeness to FS compressive imaging, we give a brief overview of CS in the following
section. The work presented in this manuscript, however, will focus on our novel
contributions to FS imager-based compressive imaging.
18
1.2 Compressed Sensing
CS was introduced through a series of papers by Candés, Romberg, Tao [20], [21],
[18], [22], [23] and Donoho [24], with many contributions from research in the field
of sparse approximations [25], [26], [27], [28], [29], [30]. The main thesis of CS is
that the conventional Nyquist sampling theory can be improved upon. Nyquist
sampling requires the number of Fourier samples to be exactly the same as the
number of pixels in an image for exact reconstruction. CS, however, says that
under certain conditions involving sparsity and incoherence the number of samples
required to reconstruct the image exactly with a high probability is much smaller
than the image resolution. In a general setting CS can be defined as a new data
acquisition paradigm which says that if the scene to be imaged is sparse in a certain
representation basis, then taking compressive measurements of this scene using a
measurement basis that is incoherent with the representation basis, allows us to
reconstruct the scene to an arbitrary degree of accuracy.
To make the CS idea concrete, let us consider a signal x ∈ RN , which we sense
(or measure) through a set of vectors φm ∈ RN to get measurements
ym = hφm , xi, m = 1, . . . , M.
(1.1)
For a conventional imager, φm will comprise the standard Euclidean basis with
M = N yielding a traditional image of the object scene. On the other hand, if the
19
φm ’s are, say, the discrete Fourier basis then we take frequency measurements of
the object as in MRI. Using matrix notation we can write (1.1) as
y = Φx,
(1.2)
where φm ’s comprise the rows of the sensing matrix Φ. Given the set of measurements y, the goal is to reconstruct the signal x. CS is interested in solving
this problem when the system is under-determined (M << N), i.e. the number of
measurements are much less than native dimensionality of the signal. Using data
acquisition terminology, we take under-sampled measurements of the signal.
In general, we know from linear algebra, that this problem is ill-posed, and it has
infinitely many solutions. However, if we know that the signal x is sparse in some
basis, then Candés, Romberg and Tao [20] showed that it is almost always possible
to exactly reconstruct x by solving the following convex optimization problem,
minimize ||x||1 subject to y = Φx.
(1.3)
Sparsity: A vector z is said to be K-sparse if the number of non-zero entries is
exactly K. Sparsity has been exploited in various fields for signal estimation and
data compression. For example, shrinkage algorithm by Donoho [31] for signal
estimation exploits the idea of sparsity, and JPEG2000 based transform coding uses
sparsity of natural images to achieve scalable compression [32], [33]. Most of these
uses, however, are in the post-processing stage for better signal estimation and data
20
modeling. Compressed sensing goes a step further. If it is known the signal is sparse
in some basis, then it has a direct impact on data acquisition itself.
Let us denote by matrix Ψ the representation basis in which the signal x has
a sparse representation. Then we can express analysis and synthesis equations respectively as,
α = Ψ† x
(1.4)
x = Ψα
(1.5)
For ease of exposition we have assumed Ψ to be an orthonormal basis, and consequently, the inverse of Ψ is the self-adjoint.4 We can now re-write (1.2) as
y = ΦΨα
(1.6)
= Φ′ α
(1.7)
where Φ′ = ΦΨ. The convex program can therefore be expressed as
minimize ||α||1 subject to y = Φ′ α.
(1.8)
Assuming α is K-sparse, Candés and Tao showed that the matrix Φ′ which can
accurately estimate α should satisfy the restricted isometry property (RIP) [21].
RIP says that if we choose at random, less than or equal to K number of columns
4
The extension to overcomplete basis can be achieved by considering the generalization of the
orthonormal basis of the vector space to the frame of the vector space.
21
from Φ′ , then these columns are approximately orthonormal to each other. This
allows us to define the K-restricted isometry constant δK . Using this constant we
can then define the condition under which we can exactly recover x. Specifically,
let T ⊂ {1, 2, . . . , N}, and let |T|, the cardinality of T, be less than or equal to K.
Let us also denote by Φ′T the matrix Φ′ with all columns not indexed in T replaced
by zero column vectors. Then the K-restricted isometry constant δK is the smallest
constant such that
(1 − δK )||a||2ℓ2 ≤ ||Φ′T a||2ℓ2 ≤ (1 + δK )||a||2ℓ2 .
(1.9)
Using this property Candés and Tao proved [18] the following theorem.
Theorem 1.2.1 Assume that α is K-sparse and suppose that δ2K +θK,2K < 1. then
the solution α∗ to (1.3) is exact, i.e. α∗ = α.
Here θK,2K is called the K, 2K-restricted orthogonality constant. If we define two
disjoint index sets T ⊂ {1, 2, . . . , N} and U ⊂ {1, 2, . . . , N} such that |T| ≤ K and
|U| ≤ 2K, and K + 2K ≤ N, then θK,2K is the cosine of the smallest angle between
the two subspaces spanned by columns of Φ′T and Φ′U .
Theorem 1.2.1 states that it is possible to recover sufficiently sparse signals from
highly undersampled measurements. It is important to note, however, that a Ksparse signal is a strict condition rarely satisfied by real-world signals of interest. A
more realistic scenario is to consider compressible signals. Compressible signals can
22
be thought of signals which are not strictly supported on a sparse set, but are instead
concentrated on a sparse set with most of the signal content within or near this set.
Therefore, we can think of compressible signals as nearly sparse. Furthermore, the
measurements are imperfect due to presence of noise. This is accounted for by
re-writing the convex program as
minimize ||α||ℓ1 subject to ||y − Φ′ α||ℓ2 ≤ ǫ,
(1.10)
where ǫ is the noise deviation. For this realistic scenario Candés, Romberg and
Tao proved that it was possible to stably recover the original signal with error (in
ℓ2 -norm) no worse than if we knew the largest K terms of the signal corrupted by
noise [22].
Theorem 1.2.2 Suppose that α is an arbitrary vector in RN and let αK be the
truncated vector corresponding to the K largest values of α (in absolute value).
Under the hypothesis δ3K + δ4K < 2, the solution α∗ to (1.10) obeys
||α∗ − α||ℓ2 ≤ C1,K · ǫ + C2,K ·
||α − αK ||ℓ1
√
.
K
(1.11)
For reasonable values of δ4K the constants in (1.11) are well behaved; e.g. C1,K ≈
12.04 and C2,K ≈ 8.77 for δ4K = 1/5.
It was further shown that the largest number of entries K that can be recovered
are related to the number of required measurements through the mutual coherence
23
between Φ and Ψ. Specifically, M is upper bounded by C ·
the mutual coherence, is defined as
√
1
µ
·
K
,
(log N )4
where µ,
N maxi,j |hφi , ψj i|. Hence, the larger the in-
coherence between Φ and Ψ, the larger K can be. Seen another way, for a fixed
K, the larger µ is, the smaller the number of measurements required. The goal
therefore is to design a sensing matrix that satisfies the restricted isometry property for signals with large K. Although research efforts are on going, there is still
no standardized methodology for designing such matrices. It has, however, been
proven that random measurements (Gaussian and binary) do satisfy the RIP with
overwhelming probability. The use of random measurements has allowed CS to
develop non-adaptive signal acquisition protocols for applications in different areas
such optical sensing [16], MRI [34] and high resolution radar imaging [35].
CS approach to compressive imaging overlaps with the FS approach to compressive imaging in that both undersample the signal of interest leading to real time data
compression. However, as mentioned earlier, unlike CS which employs non-adaptive
random measurements, FS imager-based compressive imaging focusses on making
knowledge enhanced measurements, where a priori information about the signal of
interest is used to design the sensing matrix. There are on-going research efforts to
incorporate prior knowledge in the CS paradigm [36], [37].
24
1.3 Our Contribution
In this dissertation we introduce novel advancements in large field-of-view surveillence, and estimation of temporal object scene changes utilizing the compressive
imaging paradigm. Both these ideas fall under the FS imager-based compressive
imaging paradigm. As we show, in the former case the FS imager is an optical
multiplexer that performs real-time object scene encoding. In the latter case we
study various FS imagers and discuss their reconstruction-based performance. In
both cases, however, we achieve real-time compression.
We will develop these two ideas in parallel. For the sake of providing a simple categorization, we associate the former work with the “analysis” task, while
the latter is a “reconstruction” task. By analysis we mean extracting relevant information, in this case, surveillence information, without reconstructing the scene
from the non-traditional compressive object scene encodings. Reconstruction, on
the other hand, self-evidently implies faithfully estimating the temporal changes in
the scene from the non-traditional compressive object scene measurements. The
conceptual thread that unites these two approaches is that by employing task-based
compressive measurements we can achieve the desired goal, and in fact, better it
with reduced overhead costs in terms of sensor and optical costs. We now introduce
these two cases.
25
1.3.1 Optical Multiplexing For Superposition Space Tracking
There are numerous applications that require visible and/or infrared surveillance
over a large field of view (FOV). The requirement for large FOV frequently arises
in the context of security and/or situational awareness applications in both military
and commercial areas. A common challenge associated with large FOV concerns the
high cost of the required imagers. Imager cost can be classified into two categories:
sensor costs and lens/optics costs. Sensor costs such as focal plane array (FPA) yield
(i.e., related to pixel count), electrical power dissipation, transmission bandwidth
requirements (e.g., for remote wireless applications), etc. all increase with FOV.
Some of these scalings are quite severe, with process yield for example increasing
exponentially with FOV. Optics costs such as size (i.e., total track), complexity (e.g.,
number of elements and/or aspheres), and mass also increase nonlinearly with FOV.
However, these costs are somewhat more difficult to quantify without undertaking
detailed optical designs. However, to illustrate the point consider two commercial
lenses from Canon: the Canon EF 14mm f/2.8L II USM and the Cannon EF 50mm
f/1.8 II lenses. The former is a wide-angle lens (FOV = 114 degrees) while the latter
is a standard-angle lens. The wide-angle lens requires more optical elements and a
sophisticated design to maintain resolution over the field of view. As a result, the
wide-angle lens uses 14 optical elements and weights 645 grams while the standardangle lens uses 6 optical elements and weights 130 grams.
26
We note that the various costs associated with a conventional approach to wideFOV imaging often prohibit deployment of such imagers on platforms of interest.
For example, the high mass cost together with the electrical power and bandwidth
costs of conventional widefield imagers, restrict their application on many UAV
(unmanned aerial vehicle) platforms. Therefore, the motivation of this work is to
reduce the various costs of wide-FOV surveillance and thus make it feasible for more
widespread deployment.
One typical solution to this problem involves the use of narrow-FOV pan/tilt
cameras whose mechanical components often come with the associated costs of increased size, weight, and power consumption. The pan/tilt solution also sacrifices
continuous coverage in exchange for reduced optical complexity. In most traditional
approaches the goal in such problems is to reconstruct the scene of interest. Contrary
to traditional approaches, we propose a class of task-specific multiplexed imagers
to collect encoded data in a lower-dimensional measurement space we call superposition space and develop a decoding algorithm that tracks targets directly in this
superposition space. We discuss the multiplexed imagers in the next chapter. For
now, we assume that we have this ability and briefly explain the basic idea behind
superposition space tracking.
We begin by treating the large FOV as a set of small sub-FOVs (disjoint subregions of the large FOV). All sub-FOVs are simultaneously imaged onto a common
27
focal plane array (FPA) using a multiplexed imager. The multiple lens system of
Fig. 1.2 is a schematic depiction of this operation. Lens shutters can also be used to
control whether individual sub-FOVs are turned on, though for clarity the shutters
are not drawn in Fig. 1.2. The measurement therefore is a superposition of certain
sub-FOVs. A key feature of our work is applying a unique encoding for each subFOV, which facilitates target tracking directly on the measured superimposed data.
Potential encoding schemes include (a) spatial shift, (b) rotation, (c) magnification,
and/or combinations of these. The superposition of the sub-FOV’s leads to real-time
data compression, while the encoding strategy ensures that the target information
can be decoded at the post-processing stage.
Figure 1.2: Multiple lenses camera capable of performing both pan/tilt and multiplexed operations.
In spatial-shift encoding, each sub-FOV is assigned a shift such that it overlaps
adjacent sub-FOVs by a specified, unique amount. These spatial shifts can be one
dimensional (1-D) or two dimensional (2-D). In the 1-D case, the large FOV is sub-
28
divided along one dimension into smaller sub-FOVs. In the 2-D case, as illustrated
in Fig 3.3(a), the full FOV is sub-divided into two orthogonal directions. Therefore,
the 2-D case can be thought of as two separable 1-D cases with decoding information
shared between the two. Rotational encoding (Fig 3.3(b)) assigns different rotations
to each sub-FOV such that a target undergoes an angular shift in the superimposed
image when it crosses between two sub-FOVs. The rotational difference between any
two adjacent sub-FOVs must be unique. In a similar manner, as shown in Fig 3.3(c),
magnification encoding assigns a unique magnification to each sub-FOV such that
changes in the target’s apparent size can be used to properly determine its location.
These encoding methods are discussed in Chapter 3. The detailed emphasis is, however, on the 1-D spatial shift case due to its relative ease of implementation, easier
visualization and proof-of-concept explanation, and its straightforward extension to
the 2-D case. Without loss of generality we will consider 1-D horizontal spatial shift
encoding. (Note that, 1-D vertical spatial shift encoding is an equivalent case.)
The decoding process refers to the algorithm applied to determine a target’s
true location in object space (the large FOV). The decoding process is also discussed in Chapter 3. It is noteworthy that the encoding of the sub-FOV’s results
in a compressed image which is not a traditional picture at all, but has embedded
information that is unravelled by the decoding algorithm without reconstructing the
large FOV object scene. This deliberate distortion of FPA data to enhance post-
29
processing performance capacity is a novel attribute of FS compressive imaging, and
computational imaging in general [2].
1.3.2 Feature-Specific Difference Imaging
Subtracting successive image frames to accomplish a given task has many applications ranging from watermarking [38], [39] and material inspection [40] to video
compression [41], background subtraction [42], bio-medicine [43], astronomy [44] and
change detection in remote sensing [45]. In all these applications the idea of differencing is applied in two different ways. In the first, and the most common class of
methods, difference image has a temporal connotation where the difference is taken
between the object scene at two time instants to compute interframe change that
can then be qualitatively or quantitatively analyzed depending on the task at hand.
For example, interframe difference images are quantitatively classified for video compression [41], while in remote sensing multi-spectral images of the same geographical
scene are acquired at two different dates and then subtracted to qualitatively and
quantitatively assess changes in land coverage [45].
In the second class of problems, difference images are not used to capture change
over time, but instead are used to identify objects or phenomena in an imaged scene.
In some applications in astronomy for example, difference images are computed by
subtracting a reference image from test images, and are then used to detect galactic
30
microlensing [44]. Here the idea is to detect a faint phenomenon instead of analyzing
or exploiting temporal changes in the scene over time.
In the above mentioned applications the scenes are usually imaged using conventional optics. In cases involving storage and transmission of the collected raw data,
compression is generally performed as a post-processing step. As a case in point,
many commercial and professional grade video cameras use some form of built-in
transform coding to compress the image to a manageable size. In fact, as mentioned
above, conventionally acquired images are used for performing video compression
by exploiting interframe differences between them. In this second part of the dissertation we introduce different ways to estimate relative temporal changes in the
object scene (sequence of interframe difference images) from compressive measurements informed by a priori scene information. This work belongs to the first class
of methods where we estimate the interframe difference image between the object
scenes at two consecutive time instants. (Figure 1.3 gives an example of a difference
image.) The novelty lies in the proposed ℓ2 - and ℓ1 -based estimation techniques and
in the feature design. Usually, compressive measurements are associated with nonlinear ℓ1 - or ℓ0 -based reconstruction techniques, and have been successfully used as
such in the imaging context. Why then consider linear ℓ2 -based estimation? Because
it is a linear estimation method that provides an efficient closed-form solution for
the difference image estimation operator. In the rare case that the Bayesian general
31
linear model assumption [46] holds, the linear estimation operator is also optimal in
the ℓ2 sense. We also show that ℓ2 -based estimation allows us to directly estimate
the difference image without first reconstructing the object scene at the consecutive
time instants. We show that an immediate consequence of direct estimation is the
ability to exploit the spatio-temporal cross-correlation between the object scene at
the consecutive time instants. We further generalize the definition of difference image to include the difference between the object scene at nonadjacent time instants
and show how successive compressive measurements over the corresponding time
interval can be used to directly estimate this generalized difference image.
Compressive measurements are taken by optically projecting the object scene
onto some measurement basis. For the noiseless case we find this optimal measurement basis, but in the presence of noise, an optimal measurement basis is not
mathematically tractable. Therefore, we evaluate the performance of some possible
candidate measurement bases including the optimal solution obtained through a
numerical search technique.
Although ℓ2 -based difference image estimation offers many advantages for practical use, and as we will see gives good results, it makes assumptions about object
scene stationarity. This assumption is not strictly true. We therefore, also consider
ℓ1 -based estimation which does not make any such assumptions and consequently
has superior performance. It is important to note, however, that in this work we
32
(a)
(b)
(c)
Figure 1.3: (a) Object scene xk at time instant tk , (b) object scene xk+1 at time
instant tk+1 , and (c) the resulting truth difference image ∆x = xk+1 − xk .
demonstrate that the approximate ℓ2 -based estimation methods do in fact perform
well enough to suffice for many practical scenarios.
The ℓ1 -based estimation method works by exploiting the sparsity of the difference image. We set up the ℓ1 -based estimation problem as a linear inverse problem
with three different regularizers representing three different points of view of the
estimation problem. The first regularizing condition simply imposes the sparsity
constraint on the difference image. We then consider a modified form of total variation (TV) regularizer. When we make the reasonable assumption that the image
33
is a function of bounded variation (BV), TV is a natural measure used to capture
edge discontinuites, an important feature in difference images. Lastly we look at
overcomplete representations by considering a modified form of basis pursuit denoising (BPDN). We experimentally show that by using either available a priori
information or that learned from training data, we get better performance than the
CS paradigm of a sparsifying dictionary incoherently coupled with a random sensing
matrix (Gaussian or Bernoulli/Radermacher matrices).
1.4 Dissertation Outline
In Chapter 2 we describe the proposed candidate optical architectures for multiplexed imagers and discuss some FS imagers for difference image estimation. Chapters 3 and 4 are devoted to the tracking problem. Chapter 3 explains the encoding
methodology and the decoding algorithm, while results showing the efficacy of this
system design are given in Chapter 4. Chapter 5 details the linear and non-linear
difference image estimation techniques along with the different measurement kernels,
while in Chapter 6 we present the results analysing the efficacy of these estimation
techniques. In Chapter 7 we discuss the conclusions of our work and also outline
some future challenges.
34
CHAPTER 2
OPTICAL ARCHITECTURES
In this chapter we introduce architectures for optical spatial multiplexing for target
tracking. We also briefly discuss feature specific imagers [15] which inform our work
on difference image estimation.
2.1 Optical Multiplexing
Multiplexing in imaging has a long history. For the past three decades multiplexing
has been employed in high energy (x-ray and γ-ray) astronomy for reconstructing
high resolution images of the sky. Lack of ability to focus high energy radiation above
10keV [47] has led to the development of spatial multiplexing in the form of codedaperture imaging for achieving pinhole camera resolution without the attendent loss
in photon count. The coding is done by masking some photodetectors on the FPA
while allowing light to the others. This masking pattern on the FPA is referred to as
the coded aperture or the coded mask. This coded aperture results in the formation
of a non-traditional image which is then reconstructed using linear or non-linear
reconstruction techniques such as simple inversion, cross-correlation [48], photontagging [49], Wiener filtering [50], [51] and maximum entropy method [52], [53],
35
among others. The key to the success of these reconstruction techniques is the design
of the coded apertures. Special mention is reserved for cyclic coded masks known as
modified uniformly redundant array (MURA) [54] which have been recognized as the
optimal masks for coded aperture imaging. Coded aperture imaging has also been
used in the visible spectrum for image reconstruction. Our approach to multiplexing
is significantly different from coded aperture imaging, in that, our goal is not image
reconstruction, but target tracking. Towards this end, we combine optical design
with algorithmic development to altogether avoid computationally intensive image
reconstruction to achieve tracking. Our second goal is to simultaneously reduce the
system cost in terms of sensor and optics requirements while tracking the targets in
a large FOV.
Previously our colleagues, Stenner et al., have reported on the capabilities of a
novel multiplexed camera that employs multiple lenses imaging onto a common focal
plane [55]. A schematic depiction of this multi-aperture camera was shown in Fig. 1.2
in Chapter 1. Note that each lens can have a dedicated shutter (not shown). In this
camera each lens images a separate region (i.e., a sub-FOV) within the full FOV. By
appropriate choice of shutter configurations, various modes of operation are possible.
In one such mode all shutters are opened in sequence (one at a time), enabling
an emulation of pan/tilt operation. Another mode of operation allows multiple
shutters to be open at a time. This mode employs group testing concepts in order
36
to measure various superpositions of sub-FOVs and invert the resulting multiplexed
measurements to obtain a high-quality reconstructed image [56], [57], [58], [59], [60].
The third mode allows all the shutters to be open at the same time resulting in
superposition of all sub-FOVs onto the common FPA. We, however, want to encode
the sub-FOVs before superimposing them so that relevant target information can
be decoded at the post-processing stage. Towards this end we propose two optical
implementations.
f ov2
f ov1
beamsplitter
mirror
y
camera
z
(a)
x
camera
(b)
Figure 2.1: Optical setup capable of performing multiplexed operations with spatial
shift encoding for (a) Ns = 2 and (b) Ns = 8. (See [61].)
The first implementation is shown in Fig. 2.1. In this setup we employ beamsplitter and mirror combinations to form the superposition measurements. This
configuration reduces the lens count and avoids the image-plane tilt associated with
37
the configuration shown in Fig. 1.2. Figure 2.1(a) shows a setup for two sub-FOVs
consisting of a beamsplitter and a movable mirror, which serves as a building block
for a larger system. The optical field from the left sub-FOV (f ov1 ) is reflected by
the mirror followed by the beamsplitter, and then is merged with the optical field
from the right sub-FOV (f ov2 ) that has passed through the beamsplitter. The rotation of the mirror around the z-axis results in the translation of f ov1 along the
x-direction in superposition space, providing a means to control the overlap between
f ov1 and f ov2 . Figure 2.1(b) shows an assembly of such building blocks, which can
superimpose eight sub-FOVs. This implementation will serve as the basis for the
experimental results presented later in Chapter 4.
The second implementation shown in Fig. 2.2 further refines the idea proposed
in the above implementation by using a binary combiner in a logarithmic sequence
arrangement. If we consider the same eight sub-FOVs as in Fig. 2.1(b), then this
design allows us to access all eight sub-FOVs, using three stages of single-plate
beamsplitter/mirror pairs at each stage, and three shutters placed on the mirrors.
The shutters can be opened and closed in a binary sequence from 000 (all closed)
to 111 (all open, superposition operation) to multiplex the eight sub-FOVs onto
the camera. Although the effective aperture of the plate beamsplitter and mirror
combination increases at each stage, there is an overall reduction in complexity
in comparison to the optical implementation shown in Fig. 2.1(b). For a general
38
scenario, when the large angular FOV is φ radians and each angular sub-FOV is given
by β radians, the number of stages in the binary combiner is given by S = ⌈log2 φ/β⌉
and the front end effective aperture Ae required to avoid vignetting is approximately
given by
φ
Ae = 2A(1 − tan )−1 (1 − β)−S ,
4
(2.1)
where A is the aperture of the camera. To obtain this relation between effective
front end aperture and the large angular FOV we fix the plate beamsplitter to be at
an angle of π/4 with respect to the vertical axis and adjust the mirror to the desired
angle depending on the value of φ. If we define the angle of the mirror from the
vertical to be γ, then for a given φ, γ = (π − φ)/4. This system gives us the ability
to employ a camera whose angular FOV is smaller than the large angular FOV by a
factor of 2S . We have discussed the optical scheme in a 1-D setting, but because the
horizontal and vertical directions are separable, extension to 2-D is straightforward.
Figure 2.2 illustrates this design concept with a specific example. The angles
are shown in degrees for this example. The implementation is designed for eight
sub-FOVs, each having an angular range of β = 7.5◦ . For simplicity, the sub-FOVs
are assumed to be non-overlapping, resulting in an angular FOV (φ) of 60◦ . The
first stage folds 0◦ to −30◦ angular range onto the 0◦ to 30◦ angular range. In the
second stage the resulting 30◦ angular range is further halved to a 15◦ range using
a plate beamsplitter and a mirror, which are at angles of 45◦ and 52.5◦ from the
39
vertical axis. The third stage again halves the 15◦ range to 7.5◦ which is the range
of a single sub-FOV. For the third stage the plate beamsplitter and mirror angles
are 45◦ and 48.75◦, respectively. If the sub-FOVs are overlapped, then the angles of
the mirrors in each stage can be adjusted to implement the given overlap.
Figure 2.2: Binary combiner in a log sequence arrangement for multiplexing 8 subFOVs each with an angular range of 7.5◦ . (See [61].)
As the angular FOV at the end of this three-stage binary combiner is reduced
by a factor of 23 to 7.5◦ , we can use a simple lens at the end of this optical setup.
We consider the TECHSPEC MgF2 coated achromatic doublet with a diameter of
12.5mm and a focal length of 35mm. Given A = 12.5mm, using (2.1) we calculate
the front end effective aperture Ae of the beamsplitter and mirror combination to
40
be 5.2cm. Since the optical system does 2-D imaging we have the same threestage binary combiner in the other dimension with the same effective front end
aperture. In total therefore, we have six plate beamsplitter and mirror combinations.
Assuming, for simplicity, that the beamsplitter and the mirror equally share the
effective aperture, the lengths of the plate beamsplitter and the mirror are given by
√
√
√
√
(5.2/2) 2cm and (5.2/2)(2/ 3)cm. The factors of 2 and 2/ 3 follow from the
plate beamsplitter and the mirror being at angles of 45◦ and 30◦ respectively from
the vertical axis. Assuming a square size for both, the Stage 3 dimensions of the
plate beamsplitter are 3.7cm × 3.7cm and that of the mirror are 3cm × 3cm. Also
assuming the thickness of the optical glass to be 2mm, we calculate that in Stage 1
the total volume of glass used by the two pairs (corresponding to 2-D imaging) of
plate beamsplitter and mirror combination is 9.1cm3 . Doing similar calculations for
Stages 2 and 3 gives the volume of glass used to be 6.8cm3 and 4cm3 , respectively.
If we take the density of the optical glass to be 2.5gm/cm3 , the total mass of the
log combiner is 49.75gm. The mass of the achromatic doublet is less than 5gm and
hence the weight of the optical system is about 54.75gm. If we were to directly use
a wide-angle lens to capture an angular FOV of 60◦ , then a potential candidate is
Canon’s EF 35mm f/1.4L USM lens. It has an angular FOV of 63◦ , but its mass is
580 gm and it uses 11 optical elements. Therefore, we see savings of about a factor
of 10 for our proposed multiplexed imager and also reduced optical complexity as
41
we are using a simple, easy-to-design binary combiner and an achromatic doublet
as opposed to an 11-element wide-angle lens.
Two practical issues that arise in designing optical systems involving beamsplitters are vignetting and transmission loss. Vignetting arises when there is nonuniformity in the amount of light that passes through the optical system for each of
the points in the FOV. This resulting non-uniformity at the periphery of the superposition data has the potential of increasing false negatives which in turn can lead
to errors in properly locating the targets. To overcome this potential problem in the
setup shown in Fig. 2.1, the size of each beamsplitter should be large enough to ensure that the angular range subtended by that beamsplitter at the camera is a strict
upper bound on the angular range of the corresponding sub-FOV. Field stops are
then used to restrict the angular range of the beamsplitter to that of the sub-FOV.
Specifically, for the setup shown in Fig. 2.1 and used in chapter 3, each sub-FOV
is 1.9◦ horizontally and 1.3◦ vertically while the angular range of the beamsplitter
is approximately 3◦ . As a result, vignetting is avoided. For the binary combiner
shown in Fig. 2.2 vignetting is not an issue because (2.1) is derived from a vignetting
analysis of the binary combiner, to give an effective aperture Ae that does not block
light from any point in the large angular FOV.
The second issue has to do with a decrease in throughput due to optical “combination loss” when the light passes through the various stages of the beamsplitters.
42
Specifically for the eight sub-FOV multiplexed imager shown in Figure 2.1(b) the optical transmission decreases by (0.5)(0.5)(0.5) = 0.125. This throughput cost, however, is no worse than that of a narrow-field imager in a pan-tilt mode which is commonly used to achieve a wide FOV [62], [63]. For our current example the dwell time
of a corresponding narrow-field imager on each sub-FOV will be 1/Ns = 1/8 = 0.125
time units. Since the photon count is directly proportional to the dwell time, we
have the same photon efficiency for both the multiplexed and wide-field imagers.
This result can be extended to a general case of Ns sub-FOVs where the photon
efficiency of both the multiplexed and wide-field imagers is reduced by a factor of
Ns . We acknowledge, however, that for the proposed beam-splitter and mirror-based
multiplexed imagers we have not taken the component losses, e.g. scattering at the
mirror, into account. We assume these losses to be small in comparison to the photon efficiency. Unlike the beamsplitter and mirror based multiplexed imagers, the
multiple-lens-based imager shown in Fig. 1.2 overcomes the disadvantage of loss in
photon efficiency at the expense of a higher optical mass cost.
The imagers we have proposed are used to simultaneuosly encode and compress
(through superposition of the sub-FOVs) the data. The subsequent algorithm then
performs target tracking by decoding the relevant information from this compressed
superpositon data. In the next chapter we discuss in detail the encoding and compression, and the decoding algorithm.
43
2.2 Compressive Feature Specific Imagers
Feature-specific imagers take scene measurements by linearly projecting the object
space onto a measurement space using knowledge-enhanced measurement kernels.
Inherent in the projection and the attendant real-time compression is the discretization of the object space with respect to a certain sytem resolution of the FS imager.
The system resolution includes the resolution of both optical system and the sensor.
We represent this resolution by ∆r . Let the object scene be H by W unit distance.
Then the pixel dimension of the scene is h = H/∆r by w = W/∆r . Mathematically
the discretized scene is generally expressed in a vector form with dimension hw × 1.
Setting N = hw, we think of the scene as an N ×1 vector. Therefore N would be the
dimension of the sensor array if conventional imaging were being used. Henceforth,
when we talk about reduced dimensionality of compressive measurements it will be
with respect to this maximal conventional imaging dimension. Thus, say N = 100,
and we take M = 5 measurements, then the compression factor is N/M = 20.
For the discrete representation, the measurement kernel is represented by a measurement matrix (or sensing matrix) Φ. Each row φi , i = 1, . . . , M of the sensing
matrix is a basis of the compressed measurement space the object scene x is optically
projected onto. This results in the measurements yi = φi x, i = 1, . . . , M, or
y = Φx,
(2.2)
44
where y is M × 1 measurement vector, Φ is M × N sensing matrix, and x is N × 1
object scene vector.
There are three basic FS imagers for taking these compressive measurements:
sequential, parallel and photon-sharing. The sequential FS imager, schematically
illustrated in Fig. 2.3, consists of a single optical aperture along with an adaptive optic mask and a single photodetector. It takes one measurement in a single
time-step. The adaptive optical mask changes its transmittance at each time step
according to the projection φi being measured. The transmittance-weighted object
scene incident irradiance is then spatially integrated by the photodetector. Additive
white Gaussian noise (AWGN) is then added to the output of the photodetector to
give a single measurement yi . The process is repeated for different projections over
M time-steps to obtain M measurements. The parallel FS imager, shown in Fig. 2.4
sub-divides the total optical aperture into M sub-apertures, each with its own fixed
optical mask and photodetector. It consequently takes all the measurements in the
same time interval.
The choice between sequential and parallel architectures has a direct impact
on measurement noise. Let us assume that the total optical aperture diameter
for both sequential and parallel architectures is D. Let us also assume that the
total time to take the M measurements is T . Then the time per measurement
for sequential FS imager is T /M, and the sub-aperture diameter for parallel FS
45
Figure 2.3: Sequential feature specific imager. (See [15].)
Figure 2.4: Parallel feature specific imager. (See [15].)
√
imager is D/ M . Consequently, for both architectures the photon count is T D 2 /M.
There is, however, a bandwidth advantage associated with the parallel architecture.
By bandwidth, we mean the inverse of the measurement time per projection. For
sequential architecture it is M/T , while for parallel architecture it is 1/T , a constant.
Thus in case of a sequential FS imager, the bandwidth scales as a function of M
and therefore, so does the noise. If we let the variance of AWGN noise be σ 2 , then
46
the noise in the case of sequential imager scales as σ 2 M/T . The disadvantage of the
parallel architecture is in terms of hardware costs as it requires M photodetectors.
The third architecture is the photon-sharing architecture schematically depicted
in Fig. 2.5. It is essentially a pipeline architecture with M stages, each stage measuring one projective measurement of the incident object scene irradiance. Each
stage uses a spatial polarization modulator and a polarization beamsplitter. The
spatial polarization modulator rotates the incident light according to the projection
being measured, and the polarization beamsplitter directs the orthogonal component of the polarization generated by the modulator to be spatially integrated at
the photodector, while allowing the rest of the incident irradiance to pass on to the
next stage. The main advantage of the photon sharing architecture is its photon
efficiency. Photon sharing architecture does not use optical masks with their absorptive losses and hence discards no useful photons. At the same time it is able to
make all the projective measurements in time T . Consequently, like the parallel imager, this architecture has the bandwidth advantage. Its disadvantage is its relative
complexity and increased hardware requirements in terms of optical elements and
photodectectors. For details on the working of these optical architectures we refer
the readers to work done by our colleagues, Neifeld et al. [5], [15].
The FS architectures discussed above perform incoherent imaging. Incoherent
imaging systems are linear in intensity, thus limiting the sensing matrix entries to
47
positive values. The optimal, mathematically derived sensing matrix, however, can
have negative entries. To bridge the gap between practice and theory we need a way
to physically implement the bipolar entries of the sensing matrix without violating
the positivity requirement. One such method is dual-rail signalling consisting of
two complementary arms. One arm implements Φ+ (the positive entries of Φ are
kept while the negative ones are set to zero) to get y+ = Φ+ x, and the second
arm implements Φ− (the absolute values of the negative entries are kept while the
positive ones are set to zero) to get y− = Φ− x. The resulting measurement y is
then given by y = y+ − y− = Φ+ x − Φ− x = (Φ+ − Φ− )x = Φx. More flexibility
can be added to this basic setup as discussed in [5].
Another constraint on the sensing matrix comes from the passive nature of the
optical architecture. An imager cannot increase the number of photons collected
by the photo detector. In other words, the total number of photons entering the
optical system is the same as the number leaving it. This condition manifests itself
as the photon count constraint, which says that the absolute maximum column sum
of Φ (or the induced 1-norm of Φ) is one. To ensure this constraint is met for
the sensing matrix being considered, the sensing matrix has to be normalized by its
maximum column sum. Let c = maxj {
P
i
|Φij |}. Then the sensing matrix satisfying
the photon count constraint is given by Φ = Φ/c.
From now on we assume that we have the capability to optically implement the
48
Figure 2.5: Photon sharing feature specific imager. (See [15].)
sensing matrix satisfying the photon count constraint.
In the Chapter 5 we we detail the potential projective measurement matrices
that can be used in these optical architectures.
49
CHAPTER 3
SUPERPOSITION SPACE TRACKING
We do not require the production of a traditional image for the purpose of target
tracking. The efficacy of a target tracking system is determined only by the the
accuracy of the estimated target positions. In this chapter we detail our encoding
techniques and the decoding algorithm for continuously tracking targets moving in
a large FOV (object scene) of interest, without incurring the computational and
primary memory (random access memory) cost of reconstructing the entire scene.1
This task-specific approach to target tracking also results in real-time data compression leading to direct savings in sensor and secondary data storage (hard disk)
costs apart from the optical cost advantages discussed in Chapter 2.
We show that by dividing the large FOV into sub-FOVs that are optically encoded and multiplexed, we are able to successfully localize the target locations without scene reconstruction. We outline the possible encoding strategies, briefly discussing each one. We, however, focus mainly on 1-D spatial shift encoding due to
ease of exposition of the proof-of-concept. We then detail the decoding algorithm.
1
This cost can be, for example, due to the inversion of the correlation matrix, or the execution
of an iterative estimation technique.
50
We show how the decoding algorithm can be used for single, as well as multiple
target decoding. We show this even for scenarios where the multiple targets are
identical to each other, and morphological discriminating criteria cannot be used.
We begin this chapter by introducing the object, superposition and hypothesis
spaces. These three spaces form the key to the development of the encoding and
the decoding methodology.
3.1 Sub-FOV Superposition and Encoding
In a manner identical to FS imagers introduced in Section 2.2, we assume the large
FOV to be H distance units in the vertical dimension by W distance units in the
horizontal dimension. To best illustrate the proof-of-concept of our approach, we
consider the 1-D horizontal spatial shift encoding. We next define the following
three domains or spaces we will work with: object space, superposition space, and
hypothesis space.
The object space is the full area on the ground that is actually observed by
the sensor. For sake of clarity, we begin by letting the sub-FOVs be uncoded.
Uncoded sub-FOVs are obtained when the H × W area of the large FOV is simply
divided into Ns adjacent sub-FOVs, without overlap, as shown in Fig. 3.1(a). Each
of the resulting sub-FOVs is H × Wf ov in size where Wf ov = W/Ns . Using an
optically multiplexed imager, we image each of these sub-FOVs onto a common
51
FPA. Assuming, as in Section 2.2, that the object space resolution of the optical
system is ∆r distance units in each direction, the dimensionality (in pixels) of a
single sub-FOV is then given by H/∆r pixels by Wf ov /∆r pixels.
Optical multiplexing of all sub-FOVs onto an FPA the size of a single sub-FOV
corresponds to superposition of all the sub-FOVs. This measured image comprises
what we call the superposition space. Note that the superposition of all sub-FOVs
onto a single FPA provides data compression - H/∆r by W/∆r pixels are imaged
using an H/∆r by Wf ov /∆r-pixel FPA. Specifically, the compression ratio for the
uncoded case being discussed is Ns . Measuring only the superposition, however,
introduces ambiguity into the target tracking process. Consider a single target moving through the first sub-FOV in the uncoded object space as shown in Fig. 3.1(a).
The corresponding superposition space looks like Fig. 3.1(b). Based on the encoding used (in this case none) and the size of a single sub-FOV we decode possible
locations of the target in object space. We call this new space the hypothesis space.
(See Fig. 3.1(c).) The hypothesis space is not a reconstruction of the object space
- it is a visualization of the decoding logic.
The single target detected in the superposition space of Fig. 3.1(b) does not
provide information about the true sub-FOV where the target is located. Therefore,
there is ambiguity in the hypothesis space, and we hypothesize that the target could
be in any of the Ns sub-FOVs. In fact, for this uncoded case, it is not possible to
52
(a)
(b)
(c)
Figure 3.1: (a) The area of interest (large FOV) for tracking targets along with the
reference coordinate system. The large FOV is sub-divided into 4 non-overlapping
sub-FOVs in the x-direction. In this un-coded case the object space and the FOV are
the same. (b) Superposition space. (c) Hypothesis space: From the superposition
space it is not possible to tell which sub-FOV the target belongs to. This ambiguity
in target location leads to the hypothesis that the target could be in any of the 4
sub-FOVs. (See [61].)
correctly decode the target location based on measurements in superposition space.
To overcome this ambiguity, we need a distinguishing characteristic that appears in
the superposition space yet identifies a target’s true location in the object space.
This trait can be provided by encoding the sub-FOVs with a spatial shift in the
object space. Instead of defining non-overlapping sub-FOVs, we allow for some
53
overlap between adjacent sub-FOVs in the object space as seen in Fig. 3.2(a). In
this shift-encoded system, when a target passes through an area of overlap between
two or more sub-FOVs, instead of a single target being present in superposition
space there are multiple copies of the original target as shown in Fig. 3.2(b). We
refer to these multiple copies as ghosts of the target. The presence of these ghosts
allows the target location to be decoded as long as the pairwise overlap between
adjacent sub-FOVs is unique.
The overlaps can be designed in different ways. They can be random and unique
such that no two overlaps are the same, or they can be integer multiples of a fundamental shift. Also, these integer multiples need not be successive, but can be
randomly picked from the available set. The simplest design, however, is to let the
overlaps be successive multiples of a fundamental shift, which we call the shift resolution δ. Given that there are Ns sub-FOVs, we define the unique overlaps to be the
overlap set O = {0, δ, 2δ, . . . , (Ns − 2)δ}. We can now construct a 1-D spatial-shift
encoded object space as follows:
1. Start with the uncoded sub-FOVs.
2. Let the first two (from the left) sub-FOVs be in the same position (nonoverlapped) as in the uncoded case. We label these sub-FOVs as f ov0 and
f ov1 respectively;
54
3. Shift the ith sub-FOV (f ovi ), i = {2, . . . , Ns − 1} to the left such that it
overlaps with the (i − 1)th sub-FOV by Oi−1 . Depending on the size of the
overlap, it is possible that the shift causes portions of f ovi to overlap with
more than one sub-FOV. (Figure 3.4 is an example.) One condition that must
be satisfied is that f ovi cannot completely overlap f ovi−1 . This condition
implies a restriction on the shift resolution δ. The shift resolution can range
from zero (uncoded) to a maximum of δmax = Wf ov /(Ns − 1).
As shown in Section 3.2, the resulting encoded object space enables the target
location to be decoded. The disadvantage is that the object space now covers a
smaller area than the uncoded case. The object space is largest when δ = 0, which
corresponds to an uncoded case with target decoding ambiguity. The object space is
smallest (covers the least area) when δ = δmax . Between these two extremes, there
is a compromise between area coverage and the smallest shift resolution that must
be detected in the decoding procedure.
To quantify the reduction in area coverage, we define area coverage efficiency
(η) to be the ratio between the area of the encoded object space and the uncoded
object space. The shift resolution also affects the compression ratio (r), which is
defined as the ratio of the area of the encoded object space to the area of a single
55
(a)
(b)
Figure 3.2: (a) A portion of an encoded object space with the target in the region
of overlap between the two adjacent sub-FOVs along with its distance from the
two boundaries of the overlap. The loss in area coverage due to encoding is also
shown. (b) Superposition space with two target copies or ghosts. The one to the left
corresponds to f ovi and the one to the right corresponds to f ovi−1 . Also depicted
is the relation between the distance measures ℓ1 and ℓ2 in the object space, and the
separation distance between the target ghosts in the superposition space. (See [61].)
sub-FOV. The area coverage efficiency and the compression ratio are given by
(Ns − 2)
,
2Ns
(Ns − 2)
r = Ns − α
,
2
η = 1−α
(3.1)
(3.2)
where α = δ/δmax lies between 0 and 1. Small α results in better area coverage
and higher compression ratio but smaller shift resolution. In addition, if we de-
56
fine a boundary as a line in object space where there is a change in the sub-FOV
overlap structure, then small alpha also results in a lower boundary density, which
can adversely affect the average time required to properly decode a target. The
opposite characteristics are true for values of α near unity. We quantify trade-offs
between area coverage, decoding accuracy, and average decoding time in more detail
in Chapter 4.
Figure 3.3: Schematic representation of encoding schemes: (a) Spatial shift (twodimensional case shown), (b) rotation, and (c) magnification. (See [61].)
3.1.1 2-D Spatial, Rotational and Magnification Encodings
Although 1-D spatial shift encoding is the main focus of this work, we now take
time to briefly describe other potential encoding schemes illustrated in Fig. 3.3. As
mentioned in Section 1.3.2, 2-D spatial shift encoding can be thought of as two
57
separable 1-D spatial shift encodings. Specifically, instead of sub-dividing the large
FOV into smaller sub-FOVs in only the horizontal x-direction, we also sub-divide
in the vertical y-direction. Again starting with the uncoded case, if the number of
sub-FOVs in the two directions are Nx and Ny respectively, the size of each resulting
sub-FOV is Hf ov × Wf ov where Hf ov = H/Ny and Wf ov = W/Nx . Defining the shift
resolutions in the two dimensions as δx and δy , the overlap sets are then given by
Ox = {0, δx , 2δx , . . . , (Nx −2)δx } and Oy = {0, δy , 2δy , . . . , (Ny −2)δy }. The encoding
procedure described above for a 1-D system can be separably applied to the subFOVs in both the x- and y-directions to give the 2-D spatially encoded object space.
In this 2-D encoding scheme, each sub-FOV is characterized by a unique pairwise
combination of horizontal overlap from Ox and vertical overlap from Oy . The area
coverage efficiency and the compression ratio are given by
η = (1 − αx
(Nx − 2)
(Ny − 2)
)(1 − αy
),
2Nx
2Ny
r = (Nx − αx
(Ny − 2)
(Nx − 2)
)(Ny − αy
),
2
2
(3.3)
(3.4)
where αx = δx /(δx )max , αy = δy /(δy )max , and (δx )max and (δy )max are the maximum
allowable shifts in the two dimensions.
In the case of rotational encoding the objective is to define unique rotational differences between any two sub-FOVs.
The simplest way to do this is
to define a fundamental angular resolution (δang ) and let all the rotational dif-
58
ferences be integer multiples of (δang ). The resulting rotational difference set is
R = {0, δang , 2δang , . . . , (Ns − 1)δang }. One must be careful, however, to note that R
is a set of rotational differences, i.e., the difference between the absolute rotations
of any two sub-FOVs. The absolute rotation assigned to the ith sub-FOV is then
Pi
j=0 Rj , i
= 0, 1, . . . , Ns − 1. Furthermore, since rotational encoding is periodic ev-
ery 360◦ , it is logical to upper bound the maximum absolute rotation by 360◦ . This
bound results in a maximum angular resolution of δmax = 2 · 360◦ /(Ns (Ns − 1)).
Rotational encoding like spatial shift encoding can be applied to sub-FOVs in either
or both x- and y- directions. We call rotational encoding 1-D when the large FOV
is sub-divided in either x- or y- direction and rotational encoding is then applied
to the resulting sub-FOVs. 2-D rotational encoding refers to the case where rotational encoding is applied to the sub-FOVs resulting from the sub-division of the
large FOV in both the directions. Unlike 2-D spatial shift encoding, however, 2-D
rotational encoding is not separable. It requires that the rotational difference between any two sub-FOVs, regardless of the dimensions they lie along, be unique. In
2-D spatial shift encoding on the other hand, overlaps have to be unique only with
respect to one direction. As a result even when Ox = Oy = O and Nx = Ny = Ns ,
the resulting 2-D spatial shift encoding is valid because each sub-FOV will still have
a unique overlap (Oi , Oj ), i, j ∈ 0, 1, · · · , Ns − 1 associated with it.
Magnification encoding assigns unique magnification factors to each sub-FOV.
59
The magnification factors depend on the optical architecture and the size of the
area of interest. 2-D magnification encoding, like 2-D spatial shift encoding, is
separable as long as we can separably control the magnification factors along the
two directions. We can now define sets of unique magnification factors Mx and
My along the x- and y-directions respectively. This results in an encoding scheme
similar to the 2-D spatial shift encoding. As a result, in a manner analogous to
2-D spatial shift encoding, even when Mx = My = M and Nx = Ny = Ns , the
magnification encoding scheme is valid because each sub-FOV will still have a unique
combination (Mi , Mj ), i, j ∈ 0, 1, · · · , Ns − 1 of horizontal and vertical magnification
factors applied to it. Finally, we note that it may be possible to obtain greater area
coverage for the same FPA by combining several encoding methods.
3.2 Decoding: Proof of Concept
Properly encoded spatial shifts enable decoding of the true target location whenever
the target crosses a boundary into a new overlap region, and possibly sooner. Depending on the sub-FOV shift structure, target replicas or ghosts can appear only
at certain fixed locations, that is, the distance in superposition space between any
set of ghosts corresponding to the same target will have a unique pattern because
each sub-FOV overlap is unique. From this unique pattern, we can identify the
sub-FOVs involved and uniquely localize the target in hypothesis space. Once the
60
target is uniquely located in hypothesis space, we have decoded the target’s position
in object space.
In Sub-Sections 3.2.2 and 3.2.3 we explain and demonstrate the decoding procedure - first for a single target and then for multiple targets. We begin though with
a brief discussion on correlation based tracking employed in this work.
3.2.1 Correlation-based Tracking
In the simulations and performance analyses we use Kalman filter to track target
ghosts in the superposition space. Our Kalman tracker has a length-four state
vector, the four states being the x- and y-coordinates, and the x- and y-direction
velocities of detected target ghosts. We represent this state vector as


 x 




 y 


.
s=


 v 
 x 




vy
(3.5)
Kalman tracking involves two basic steps:
1. Time update: Propagating forward, in time, the current (time t) state and
error covariance estimates to obtain their a priori estimates at time t + 1.
2. Measurement update: Updating the a priori estimates with the measurement
data to obtain the a posteriori estimates.
61
To perform these two steps, we first build the state and the measurement models.
State model : The state model is a recursive model which predicts the current state
from the previous state. Here, we model the velocity of the target ghost as
vx (t) = vx (t − 1) + nvx (t),
(3.6)
vy (t) = vy (t − 1) + nvy (t).
(3.7)
Thus the velocity is constant except for perturbations (change in speed) arising due
to speed adjustment caused by various factors such a bend in the road, wind, etc.
These perturbations in the x- and y- directions are modelled as noise terms nvx and
nvy respectively. Corresponding to this velocity model, we define the coordinate
model to be
x(t) = x(t − 1) + vx (t − 1)δt,
(3.8)
y(t) = y(t − 1) + vx (t − 1)δt, ,
(3.9)
where δt represents the incremental time-step. By defining the coordinates as a
funcion of velocity we have removed the need to incorporate any noise terms in the
coordinate model. Therefore our state model can now be written as
s(t) = As(t − 1) + n(t),
where,
(3.10)
62




 x(t) 







 y(t) 



,A = 
s(t) = 




 v (t) 

 x







vy (t)


1 0 δt 0 
 0




 0

0 1 0 δt 

 and n(t) = 


 n (t)
0 0 1 0 
 vx





nvy (t)
0 0 0 1






.





Associated with the noise perturbation n(t) is the process noise covariance matrix
Q = E[n(t)nT (t)] which is dynamically updated depending on the acceleration of
the ghost target. “Process” refers to the tracking of the target ghosts.
Measurement Model : The measurement model is a data collection model used here
to model correlation-based measurements of the x- and y- coordinates of the target
ghosts. Consequently, we write the measurement model as
z(t) = Hs(t) + ξ(t),

(3.11)

 1 0 0 0 
. Associated with the
where ξ(t) is the measurement noise and H = 


0 1 0 0
measurement noise is the measurement covariance matrix R = E[ξ(t)ξ T (t)] which
we assume to be constant and pre-compute it off-line.
Based on the state and measurement models, we define the a priori and a posteriori errors respectively as
e− (t) = ŝ(t) − ŝ(t− ),
(3.12)
e+ (t) = ŝ(t) − ŝ(t+ ),
(3.13)
63
and the corresponding a priori and a posteriori error covariance matrices as
M− (t) = E[e− (t)eT
− (t)],
(3.14)
M(t+ ) = E[e+ (t)eT
+ (t)].
(3.15)
The a priori state estimate ŝ(t− ) is the estimate resulting from the time update,
while the a posteriori state estimate ŝ(t+ ) is the estimate after the measurement
update. Note, ŝ(t) is the estimate of the state at time t.
Our goal is to minimize the a posteriori error covariance. Towards this end, we
write state estimate ŝ(t) as
ŝ(t) = ŝ(t− ) + K(z(t) − Hŝ(t− )),
(3.16)
where K is the Kalman gain and (z(t) − Hŝ(t− )) are the measurement innovations.
The Kalman gain K that minimizes the a posteriori error at current time t is given
by
K(t) = M− (t)HT (HM(t− )HT + R)−1.
(3.17)
This Kalman gain is used in the measurement update to compute the current error
covariance matrix M(t) from M(t− ) by
M(t) = (I − K(t)H)M(t− ),
(3.18)
and also plugged back in (3.16) to estimate the current state ŝ(t). In the time update
step M(t) and ŝ(t) are forward propogated to get the a priori estimates M((t + 1)− )
64
and ŝ((t + 1)− ),
s((t + 1)− ) = As(t),
M((t + 1)− ) = AM(t)AT + Q,
(3.19)
(3.20)
and thus we implement the recursive Kalman tracker. For each target ghost a
separate Kalman tracker is initialized. Once we have decoded the target using the
state vectors of the target ghost tracks and the decoding algorithm described in
Section 3.2.2, we replace the Kalman trackers of the target ghosts with a single
Kalman tracker for the target.
We mentioned above that the measurements we use in the Kalman tracker are
correlation-based. In fact, correlation performs two specific tasks: (1) locating the
target ghost positions in the superposition space which are the measurement inputs to the Kalman filter, and (2) separating weak target ghosts from noise and
background clutter. If the target ghosts have a strong signal-to-noise ratio (SNR),
then the two steps can be performed using simple template matching. On the other
hand, if the signal strength is weak, template matching does not suffice. This is
important on two counts. The first is the multiplex disadvantage [64]: photon noise
from different multiplexed sub-FOVs contributes to the noise of each target ghost
reducing its SNR. This is especially true for targets lying in regions of no adjacent
sub-FOV overlap. In this case photon noise from all Ns object-space sub-FOVs in-
65
creases the noise floor that affects the target ghost SNR in the superposition space.
The effect is somewhat mitigated for targets lying in regions of adjacent sub-FOV
overlap. For example, consider the case where the target lies in a region where three
sub-FOVs overlap. Here, although noise from the Ns sub-FOVs is still present, its
effect is reduced due to target photons incident on the FPA from three multiplexed
overlapping sub-FOVs.
The second is that in our proposed technique superposition of the sub-FOVs
results in a reduction of the target ghost’s dynamic range. This reduction also
follows the same trend as the multiplex disadvantage. Specifically let us consider
Ns object-space sub-FOVs, each with a dynamic range of [0, R]. Let the target be
present in only one sub-FOV and in a region that does not overlap with adjacent subFOVs. When we superimpose these sub-FOVs, the dynamic range of the resulting
superposition space, neglecting saturation, is [0, Ns R] while the dynamic range of
the target is still [0, R]. Therefore, the target strength compared to the background
is suppressed by a factor of Ns . If the target is in a region of overlap of M sub-FOVs,
M < Ns , then this factor is reduced to Ns /M.
The above analysis shows that there is a trade-off between the target ghost
signal strength and dynamic range, the number of sub-FOVs (Ns ), and the size of
the object space. (For a given Ns , the object space size is related to the number of
overlaps M and the shift resolution δ.) This necessitates that our system be able
66
to deal with presence of weak targets. There is a two-fold strategy we consider.
First, use background subtraction technique to remove the background. Second,
use correlation filters to locate the target ghosts. We briefly look at each of these.
Background subtraction is an intuitive and a well studied paradigm for reducing background clutter and thereby increasing target SNR and dynamic range. In
almost all real life cases, the background is non-static and consequently, many statistical background subtraction techniques have been proposed and studied. In one
such class of methods, depending on the complexity of the background, the background image pixels are modelled as having Gaussian probability density functions
(pdfs) [65] or as mixture of Gaussians (MoG) [66], [67]. If the parametric Gaussian
mixtures do not embody enough complexity then kernel based methods have also
been proposed in the literature [68]. All these methods fall under the category of
non-predictive methods. Predictive methods, on the other hand, generally employ
Kalman tracking-based techniques to characterize the state of each pixel for background estimation [69]. Recently Jodoin et al. [70] proposed a novel spatial approach
to background subtraction which works under the assumption of ergodicity: temporal distribution observed over a pixel corresponds to the statistical distribution
observed spatially around that same pixel. They model a pixel using unimodal and
multimodal pdfs and train this model over a single frame. This method allows us to
estimate the background from a single image frame which results in a faster algo-
67
rithm that requires less memory. Since background subtraction is not the main focus
of this work, we adopt a simple, albeit a slightly crude averaging method to estimate
the background. To do this we observe the scene of interest for a certain length of
time to acquire a larger number of encoded and superimposed sub-FOV images,
which we then average to estimate the background of the superposition space. The
challenge here is the requirement of a time sequence of superposition space image
frames having no target motion. This is generally not possible in real scenarios.
If, however, we can ensure that the number of targets are relatively small, say by
imaging at a certain time of the day, then this averaging procedure does work.
Background subtraction removes background clutter, but not necessarily the
noise present in the measured data. To further increase measurement robustness
against this residual noise we employ advanced correlation filters. Correlation filters, because of their shift invariance and their distortion tolerance ability, have
been successfully employed in radar signal processing and image analysis for pattern
recognition [71], [72], [73]. Here we adapt minimum-variance synthetic discriminant
functions (MVSDF) [74] to detect2 target ghost locations in the presence of residual
noise. The basic idea behind an MVSDF filter is simple: design a filter based on a
2
MVSDF filter is also used for classification. We, however treat differentiating between two
targets as a target decoding problem and use MVSDF exclusively for detecting target ghosts and
identifying their locations.
68
certain number of training images, which when matched to a test image outputs a
specified value with minimum residual noise variance. This method can be thought
of as a generalization of template matching to the case where test image need not
be one of the training images. Such test images are treated by MVSDF as noisecorrupted versions of the training images. The training images here refer to the
rough templates of the various targets of interest. Since noise robustness is built
in MVSDF, the templates need not be very exact as required in classical template
matching. Let us denote the P training images, scanned into p-length vectors, by
gi , i = 1, . . . , P (p > P ). We can collect these images in a matrix G. Let us also
define the MVSDF filter as h.
In the absence of noise we want
hT gi = 1, i = 1, . . . , P (p > P ).
(3.21)
In the presence of noise, however, there will be additional contribution from hT n
where n is the noise vector. Our goal, therefore, is to find an h such that the output
variance due to noise, hT Cn h is minimized while still satisfying (3.21). (Here Cn
is the noise covariance matrix.) This results in a constrained optimization problem
whose Lagrangian is given by
L = hT Cn h − ΣPi=1 λi (hT gi − 1).
(3.22)
69
On differentiating L with respect to h and equating the result to zero, the solution
in the matrix form is given by
hopt = C−1
n Gm,
(3.23)
where m = [λ1 , · · · , λP ]T . Noting that (3.21) can be written as GT h = 1, where 1
is a p-length vector of 1’s, it is easy to see that
−1
m = (GT C−1
n G) 1.
(3.24)
For the inverse to exist, we need Cn to be positive semi-definite, and GT G to be full
rank. For our simulation we assume the noise to be i.i.d. Gaussian resulting in the
noise covariance matrix being an identity matrix scaled by the noise variance. We
empirically found that using different training images to construct G is sufficient
to ensure that GT G is full rank. Once we have obtained the MVSDF filter we use
it as a mask to correlate with the superposition space image to identify the target
locations.
3.2.2 Decoding Procedure for a Single Target
Consider a target in object space that enters the region of overlap Oi between two
sub-FOVs, f ovi−1 and f ovi , as shown in Fig. 3.2(a). The corresponding superposition space looks like Fig. 3.2(b). Under the assumption of a single target, the
presence of two ghosts in the superposition space indicates that the target has en-
70
tered a region where two sub-FOVs overlap. To find the two sub-FOVs creating
this overlap, we first measure the distance between the two ghosts in superposition
space. Let this distance be d. Based on knowledge of the shift encoding, we then
calculate the set of all possible separations between two ghosts of a single target.
We label this set S and call it the separation set. Recalling that the set of adjacent sub-FOV overlaps is the overlap set O, the elements of the separation set are
Si = Wf ov − Oi , i = 0, 1, . . . , Ns − 2. The set S can be computed once in advance
and stored for future reference.
We can now define the set T2 ⊆ S such that it contains only those elements
of S which are realizable in the spatial shift encoding scheme. It is important to
note that not all elements of S result in a valid element of T2 . It is possible that
the region of overlap between two sub-FOVs lays within (i.e., is a sub-region of)
the region of overlap between the two mentioned sub-FOVs and one or more of
other sub-FOVs. In such a case a target present in the two sub-FOV overlap region
will always produce more than two ghosts in the superposition space. (Target 3 in
Fig. 3.4(b) is an example of this scenario.) The subscript ‘2’ in T2 indicates that
the target is in the region of overlap between two sub-FOVs only (as opposed to
regions covered by more than two sub-FOVs).
We now look at the basic principle for decoding a target in the region of overlap
between two sub-FOVs. In Fig. 3.2(a), the distances ℓ1 and ℓ2 are the distances
71
from the edges of the two overlapping sub-FOVs. In Fig. 3.2(b), we see that these
distances are the same as the distances from the two ghosts to the edge of the
superposition space. Since ℓ1 + ℓ2 is equal to the overlap between the sub-FOVs,
ℓ1 +ℓ2 is an element of O. If the measured separation d corresponds to the j th element
of T2 , then we can decode f ovi−1 as the j th sub-FOV and f ovi as the (j + 1)th subFOV. Finally, our a priori knowledge of the sub-FOV locations in object space
along with the position of the corresponding target ghosts in superposition space
can now be used to decode the target’s x-coordinate in hypothesis space. Because
we have used 1-D spatial shift encoding, the y-coordinate of the target is the same in
superposition space, hypothesis space, and object space. We have now completely
decoded the target location.
The above example considers two overlapping sub-FOVs. We can extend the
decoding procedure to the case where a single target enters a region where more
than two sub-FOVs overlap. Such a scenario can arise, for example, when the
overlaps Oi−1 and Oi are such that f ovi+1 not only overlaps with f ovi , but also
with f ovi−1 as in Fig. 3.4 where sub-FOVs f ov2 , f ov3 and f ov4 overlap. In such
cases the number of target ghosts appearing in superposition space will be equal
to the number of overlapping sub-FOVs covering the target. In general, assuming
this number to be M, we first calculate the sequence of pair-wise distances, from
left to right, between the M target ghosts in superposition space. This sequence
72
(a)
(b)
(c)
Figure 3.4: Three targets in the object space with the same velocities and ycoordinates depicting two scenarios both of which result in the same superposition
space data. (a)Scenario 1 : Two targets in the object space with Target 1 in the
non-overlapping region of f ov1 and Target 2 in the region of overlap between f ov1
and f ov2 . (b) Scenario 2 : A single target (Target 3) in the overlap between the
sub-FOVs f ov2 , f ov3 and f ov4 . (c) The same superposition space arising from the
two scenarios. (i) Ghost 1 and Ghost 3 in the superposition space are ghosts of
Target 2 in the object space, while Ghost 2 in the superpostion space corresponds
to Target 1 in the object space. (ii) Ghost 1 through 3 in the superposition space
are ghosts of Target 3 in the object space. (See [61].)
is referred to as target pattern dM . We then compare the sequence to the allowed
length-M target patterns in superposition space, which again are known because the
shift encoding structure is known. The set of allowed length-M target patterns is
denoted by TM . The matching pattern determines the proper set of M sub-FOVs,
from which the target’s position in hypothesis space can be fully decoded as in the
73
case above for two overlapping sub-FOVs.
3.2.3 Decoding Procedure for Multiple Targets
When our above mentioned two-fold strategy is able to associate target ghosts with
the correct target for multiple targets, we can simply apply the decoding procedure
for a single target to all the targets individually. On the other hand, when we have
scenarios where the targets have (1) identical or only subtle shape differences, or (2)
such a weak signal strength that only target detection is possible, we need a way to
associate the target ghosts with the correct targets. In such scenarios where direct
associations are not possible, we need a procedure for decoding multiple targets. The
proposed procedure is essentially the same as for a single target except for a predecoding step where ghosts in superposition space belonging to the same target are
associated with each other. The procedure involves the following indirect three-fold
strategy (stated here specifically with respect to 1-D shift encoding):
1. Group all detected targets in superposition space according to their ycoordinate values. Since the system has 1-D spatial shift encoding in the xdirection, ghosts belonging to the same target must have the same y-position.
However, it is possible for two different targets to also have the same ycoordinate. Therefore;
2. Compare the estimated velocities of potential targets in each group. If multiple
74
velocities are detected, it is assumed that multiple targets are present, and
the group is sub-divided. This step follows from the observation that ghosts
belonging to the same target must have the same (2-D) velocity. Members
of each target group now have the same velocity and the same y-coordinate.
Finally;
3. Begin decoding process by comparing allowed target patterns to the target patterns of groups determined in the first two steps. The allowed target patterns
are the sets Ti , i = 2, 3, . . . , K, where K is the maximum number of sub-FOVs
that overlap. We begin with the highest order target patterns (TK ) and work
down to the lowest order target patterns (T2 ). When a pattern is detected,
the target position is decoded. If an allowed target pattern is not detected, the
targets in the group are assumed to reside in regions of object space without
overlapping sub-FOVs.
When performed in order, these steps usually enable decoding of the locations of
multiple targets. Under certain conditions, however, correct decoding is not possible.
Figure 3.4 shows two scenarios, the first (Fig. 3.4(a)) involving two targets as was
explained above and the second (Fig. 3.4(b)) involving a single target in a region
with three overlapping sub-FOVs. On the rare occassions when this occurs and
the targets involved happen to have the same y-coordinate and estimated velocity,
75
it is not possible to decode which scenario is the true scenario (Fig. 3.4(c)) and,
according to the decoding rules above, the higher order shift pattern will be decoded.
(In Fig. 3.4, scenario #2 will be decoded). We illustrate this general case in Fig. 3.5
through a snapshot of four frames from an example movie. Each frame shows
the object space across the top, the superposition space in the middle, and the
hypothesis space across the bottom. Object space represents the “truth” while the
superposition space represents the actual measurement data. The hypothesis space
visualizes how the decoding logic works in real time. We stress that the hypothesis
space is not a reconstruction of the object space, but is simply a visualization of the
decoding logic. The “truth” background has been added to the hypothesis space
simply to provide visual perspective to the viewer. The first two frames in Fig. 3.5
show two targets with the same (2-D) velocity and the same y-coordinates moving
through the object space. By unfortunate coincidence the targets happen to have a
horizontal separation which is an element of set T2 . As a result, the two targets are
decoded as a single target, and their decoded position jumps around. Eventually, as
shown in the last two frames in Fig. 3.5, the velocities of the two targets change, the
superposition space ghosts are properly grouped, and the two targets are correctly
decoded. We would like to remind the reader that here we are assuming the targets
have either identical shape or have such subtle differences in shape that association
based on shape is not possible or reliable. We will continue to make this assumption
76
throughout this, and the next chapter.
3.2.4 Decoding via Missing Ghosts
Thus far, we have described how the difference, or shift, between target ghost positions can be used to properly decode target location by uniquely identifying the
overlap region that must have produced the shift. However, additional decoding
logic is available to the sensor in the form of missing ghosts. Although this additional logic may not be able to uniquely decode the target, it reduces the number of
target locations in hypothesis space. To explain this principle, consider the following
scenario.
For clarity, we focus on a single target moving through the object space. We also
assume that Ns = 4 with the sub-FOVs encoded according to 1-D shifts belonging to
the overlap set O = {0, δ, 2δ}. The target is in f ov0 and is moving towards f ov1 in
the object space as illustrated in Fig. 3.6(a). Since there is zero overlap between f ov0
and f ov1 , superposition space has a single target as seen in Fig.3.6(b). Based on
the superposition space measurement and the decoding strategy explained above,
the target cannot be completely decoded. We can only hypothesize a potential
target location in each of the four sub-FOVs. Therefore, hypothesis space looks
like Fig. 3.6(c). The local x-coordinate of each hypothesized target in its respective
sub-FOV is the same. However, we can apply our knowledge of overlaps to rule
77
(a) Target decoding error
(b) Decoded position jumps around
(c) Targets decoded
(d) Targets remain decoded
Figure 3.5: Four frames from an example movie showing error in decoding two
targets with the same velocity, the same y-coordinate, and with a separation which
is an element of set S2 . Correct decoding is possible only when the targets begin
to differ in their velocities. In (a) the targets are incorrectly decoded as a single
target. The incorrectly decoded target jumps to a different localized position in (b).
Change in relative velocity allows the correct decoding of the targets in (c). Part
(d) shows that targets remain successfully decoded.
78
out f ov2 . This sub-FOV cannot be allowed because if the target were truly at this
position, it would imply that the target resides in an overlapping region between
f ov2 and f ov3 . The absence of a second ghost in superposition space tells us this is
not true. Therefore, the target can only be in f ov0 , f ov1 or f ov3 , and we have
(a)
(b)
(c)
Figure 3.6: (a) Encoded Object Space with 4 sub-FOVs. The target is in f ov0 . The
local f ov0 x-coordinate lies between Wf ov − O2 and Wf ov . (b) The superposition
Space with a single target indicating that the target is in a non-overlapping subFOV region. (c) Hypothesis Space with 4 potential targets. The third hypothesized
target, however, lies in a sub-FOV overlap which if true would produce ghosts in
the Superposition Space. The absence of ghosts rules out this hypothesized target
as a potential true target.
79
reduced the target hypotheses from 4 to 3. If the target continues to move toward
f ov2 , additional sub-FOVs will be ruled out by the same logic. In general, missing ghosts can be used to rule out anywhere from one to all incorrect sub-FOVs
depending on the target location and encoding structure.
80
CHAPTER 4
SUPERPOSITION SPACE TRACKING RESULTS
To demonstrate and quantify the efficacy of the proposed 1-D spatial shift encoding
scheme, we present results generated from both simulated data and a laboratory
experiment. The simulated data results include an example simulation that details
the various facets of the decoding procedure. The results also present system performance curves based on two criteria: average decoding time and probability of
decoding error. They include a detailed discussion of the performance trade-offs
between average decoding time, probability of decoding error and area coverage efficiency. We conclude the chapter with results from a laboratory experiment showing
the success of the system in practical scenarios.
4.1 Simulation
We simulated an object space with Ns = 8, 1-D spatial-shift encoded sub-FOVs.
The size of each sub-FOV was 64∆r distance units by 64∆r distance units, where
∆r , the object space resolution, was assumed to be finer than the size of the targets
of interest. The corresponding sub-FOV dimensionality (in pixels) is 64 by 64.
81
Figure 4.1: Examples of valid target patterns. (See [61].)
We first defined the overlap set as O = {0, 1δ, 2δ, 4δ, 3δ, 5δ, 6δ} where δ was
chosen as δ = δmax = f loor(Wf ov /(Ns − 1)) = floor(64/7) = 9 pixels. The reason for
this choice of δ was that large δ increases the boundary density in the superposition
space, which increases the total overlapped area. As a result, the number of target
ghosts in superposition space that need to be tracked increases. If the decoder and
tracker can handle a large number of targets in superposition space, we can gain confidence that the decoding procedure is robust. Note that the overlaps in the set O
are not monotonically increasing, which shows that the order of the overlaps in 1-D
shift encoding is arbitrary. Based on this overlap set, the separation set is computed
to be S = {64, 55, 46, 28, 37, 19, 10}. The allowed target patterns are then given
by T2 = {64, 55, 46, 28, 37, 10}, T3 = {{37, 28}, {19, 37}, {10, 19}} and T4 =
82
{{10, 19, 37}}.
The
these patterns are F2
sets
=
of
overlapping
sub-FOVs
corresponding
to
{{0, 1}, {1, 2}, {2, 3}, {3, 4}, {4, 5}, {6, 7}}, F3
=
{{3, 4, 5}, {4, 5, 6}, {5, 6, 7}} and F4 = {{4, 5, 6, 7}}.
out that {5, 6} is absent from the set F2 .
It is important to point
This does not imply that the two
sub-FOVs do not overlap, which they do, but instead means that the region of
overlap between f ov5 and f ov6 is also overlapped by a third or a even a fourth
sub-FOV as shown in the sets F3 and F4 respectively. In Fig. 4.1 we illustrate the
allowed target patterns with a couple of examples.
We simulated a scenario by populating the object space with four identically
shaped targets appearing at random locations, with random velocities, at random
times, and lasting for random durations. We allowed the starting target locations
to be anywhere in the object space with equal probability. The velocities were
uniformly distributed between 0 and 3∆r distance units for every time step to ensure
that the target movement looked real. The start and stop times were uniformly
distributed between 0 and 100 time steps. Figure 4.2 shows a series of frames of
one such example simulation which best illustrates all the facets of our decoding
procedure. We explain the figure in the next paragraph. Our algorithm is able to
handle many more targets, including identical targets, than the four chosen here, but
using only a few targets allows the reader to easily follow the decoding logic shown
in Fig. 4.2. The identical shapes for the targets have been chosen to emphasize the
83
Superposition Space
Hypothesis Space
(a) Red target appears
(b) Red target moving
(c) White target appears
(d) White target ambiguity reduced via
missing ghost logic
(e) Red target ambiguity reduced via (f) Red and white targets decoded via
missing ghost logic
missing ghost logic
Figure 4.2: Decoding of 4 targets: Each frame depicts the encoded object space
(δ = δmax ) at the top, the corresponding measured superposition space (with static
background subtracted out) in the middle, and the resulting hypothesis space at the
bottom. Parts (a) and (b) show the red target. In part (c) the white target appears
whose ambiguity is then reduced via “missing ghosts” logic in part (d). Red target
ambiguity is also reduced in part (e), and both red and white targets are decoded
via “missing ghosts” logic in part (f).
84
(g) White target ambiguity increases
(h) Blue target appears in the region of
overlap
(i) Blue target decoded immediately
(j) Red target ambiguity increases
Figure 4.2: Decoding of 4 targets contd.: Part (g) shows the white target ambiguity
increasing. In part (h) the blue the target appears, and is almost immediately
decoded in part (i). Part (j) shows an increase in ambiguity of the red target.
85
(k) Green target decoded
(l) White target decoded
20
40
60
80
100
120
140
160
180
50
(m) Red target decoded
100
150
200
250
300
(n) All four targets are successfully decoded
Figure 4.2: Decoding of 4 targets contd.: Parts (k), (l), (m) successively show
the green, white and red targets being decoded. In part (n) all targets have been
successfully decoded
86
robustness of the algorithm to lack of morphological distinctions between targets.
The series of frames in Fig. 4.2 show four identically shaped and color-coded
targets in the object space. Color coding allows easy target discrimination for the
reader. The first target in the object space is the red target which is followed by
a white target. Both these targets are decoded via missing ghosts logic as they
travel through the object space. This can be seen in the hypothesis space where the
ambiguity in the locations of these two targets is completely removed even before
the targets reach a boundary. The green target appears next and is followed by
the blue target. Since the blue target appears in a region of overlap of f ov1 and
f ov2 , it is almost immediately decoded. The green target is decoded when it reaches
the region of overlap between f ov1 and f ov2 . Thus all four targets are successfully
decoded. In Figs. 4.2(g) and 4.2(j), however, we see that the white and red target
ambiguity respectively increases. This is primarily due to the high speed of the
targets while crossing the boundary between the non-overlapping sub-FOVs f ov0
and f ov1 , which did not give the Kalman tracker enough time to decode the target
ghosts that appeared in the superposition space when the two targets were on the
boundary. This scenario has been included to depict a real-world scenario and is
elaborated upon in Section 4.2. The two targets are decoded when they reach a
region of sub-FOV overlap.
Figure 4.2 shows us that the decoding time of targets differs depending on the
87
target location and the target velocity. Also, because of possible errors in measurement it is possible that we incorrectly decode a target. This is especially true for
targets with very low SNR. Therefore, decoding time and measurement errors can
affect system performance. We next consider two metrics useful in quantifying this
system performance. One metric considers the effect of shift resolution on average
decoding time and the other considers the probability of incorrect decoding. We
investigate these metrics in the following sub-sections.
4.2 Average Decoding Time and Area Coverage Efficiency (η)
A small shift resolution (small α) implies that the degree of overlap between adjacent
sub-FOVs is small. One potential disadvantage of small shift resolution is that the
average time it takes to decode a target (decoding time) increases.
The problem can arise as follows: when a new target ghost appears in superposition space, velocity estimates are not instantaneously available. Therefore, it
takes some time to determine if the new ghost should be associated with an existing group, or if it is due to a new target altogether. To get velocity information
we must wait a few time steps while the Kalman tracker updates the state vector
and obtains a stable velocity estimate. This waiting period can present a problem,
especially when the target is in a sub-FOV with a small overlap. If the target has a
high x-velocity, it is possible that the target may not stay in the overlap region for
88
enough time for the Kalman tracker to ascertain the target velocity. This results
in increased decoding time. (This is what happened to the white and red targets
in our example simulation.) Moreoever, for systems with small shift resolution, the
distance between two different overlap regions is relatively large (equivalent to lower
boundary density), which again tends to increase decoding time.
40
38
36
Time Steps
34
32
30
28
26
24
22
20
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Area Coverage Efficiency (η)
Figure 4.3: Plot of decoding time as a function of the area coverage efficiency η. The
plot shows the error bars representing the ±1 standard deviation of the decoding
time from the mean. (See [61].)
A large shift resolution (α close to 1), on the other hand, suffers less from the
above mentioned disadvantages, but has a smaller area coverage due to larger overlaps. Hence, shift resolution controls the trade-off between decoding time and area
coverage efficiency. A small shift resolution provides larger area coverage, but larger
decoding times. A large shift resolution provides smaller area coverage, but shorter
89
decoding times. We quantify this result by plotting the decoding time as a function of the area coverage efficiency in Fig. 4.3. The plot also shows the error bar
representing ±1 standard deviation of the decoding time from the mean. The plot
was computed by averaging the decoding times of 300 targets which passed through
the object space in batches of size uniformly distributed between 5 through 10. In
each batch the targets appeared at random times with random velocities and at
random locations, and for random durations in a manner identical to the targets
in the movie in Fig. 4.2. The plot shows an approximately linear relationship between the two metrics. This is expected because displacement and time are linearly
related through velocity. An increase in overlap due to a larger shift resolution for
the same target with the same target velocity decreases the amount of time it takes
for the target to reach the regions of overlap because these regions are now wider
and cover more area. The reduced decoding time is directly related to this reduced
travel time to reach the regions of overlap. The reasoning is the same when using
“missing ghosts” logic to decode targets. The larger the overlaps, the faster we can
reduce the ambiguity between hypothesized targets, and the smaller the decoding
time. It is important to note here that complete decoding using “missing ghosts”
plays a less prominent role in affecting the decoding time because it can completely
decode only those targets which are in the first sub-FOV from the left. For all other
cases the absence of shifts reduces, but does not completely eliminate, all the target
90
ghosts and as a result does not affect decoding time.
The approximate linear relationship between decoding time and area coverage
efficiency leads us to the interesting observation that there is no optimtal shift resolution. For any value of the optimal shift resolution we either improve decoding
time and reduce area coverage efficiency, or do vice-versa. This is true for both 1-D
and 2-D spatial shift encoding/decoding strategies. The random or monotonically
increasing ordering in the overlap set O also has no implication on this linear relationship. It seems possible, however, that by combining spatial shift encoding with
rotation and/or magnification encodings we can design an optimal encoding strategy. This is a possibility because by combining different encoding strategies we can
ensure that if one encoding scheme is weak at one spatial location we can design the
other one to be stronger. By defining the decoding strength of an encoding strategy,
for a given sub-FOV, as the average time to decode the target location for the target
that first appears in that sub-FOV, we can quantify the decoding strength for a each
sub-FOV and encoding scheme. Then using basic graph theory, we can simply trace
the optimal path1 through the nodes - where each node is specified by the decoding
strength of a sub-FOV and an encoding scheme - that gives us the shortest overall
decoding time. This approach can be a possible starting point for designing optimal
optical multiplexing schemes in the future.
1
This is not the travelling salesman problem, which is NP-complete.
91
4.3 Probability of Decoding Error
We next consider the probability of decoding error, which is defined as the probability that the target pattern decoded from superposition space measurements is
an incorrect pattern. In the presence of noise and other distortion, the estimated
position of a target ghost in superposition space will be subject to error. Therefore,
the difference between two ghost positions, which is the criterion for target decoding
in a shift-encoded system, will also be subject to error. These errors can lead to
the wrong pattern being detected, which will cause the target to be decoded to the
wrong location. Furthermore, as the shift resolution δ is decreased, more fidelity in
estimating target shifts is required.
Figure 4.4 shows the results from a simplified calculation of decoding error for a
single target present in a region of overlap between M sub-FOVs. We first consider
overlaps between two sub-FOVs and then extend the result to overlaps between three
and four sub-FOVs. We assume, as we did for the simulation example in Section 4.1,
that the imaging system superimposes 8 sub-FOVs onto a single FPA. The width
of each sub-FOV is 500∆r distance units, where ∆r is the object space resolution.
We assume that the error in estimating the position of a ghost in superposition
space is Gaussian distributed with variance determined by the Cramer-Rao Bound
92
(CRB) [46] applicable to this problem. The CRB is
var[x̂] ≥
1
,
SNR · Brms
(4.1)
where SNR is proportional to the target intensity and Brms is the root-mean-square
(rms) bandwidth of the target’s intensity profile in the encoded dimension. The
Brms is given by
Brms =
=
R Wt /2 dt(x) 2
−Wt /2
dx
dx
R Wt /2
(4.2)
t2 (x)dx
−Wt /2
R inf
− inf
(2πF )2|T (F )|2 dF
R inf
|T (F )|2dF
− inf
,
(4.3)
where T (x) is the target’s intensity profile and T (F ) is its Fourier transform. We
have used a symmetric triangular intensity pattern, which has a closed-form rms
bandwidth of 12/Wt2 [pixels−2 ] where Wt is the target width in pixels.
We first consider the case of a single target present in a region overlapped by two
sub-FOVs (M = 2). The target pattern for this case consists of a single distance
between two ghosts. If the position errors on the two ghosts are independent, then
the variance of the distance estimate is twice the variance in (4.1). We assume that
the target is decoded only if the measured overlap matches an allowed overlap from
the overlap set O to within some prescribed tolerance ǫ. (Note that the distance
between the two ghosts is related to the overlap through Si = Wf ov − Oi , i =
0, 1, . . . , Ns − 2. See Section 3.2 for more details.) If the measured overlap is not
93
within ǫ of a valid overlap then the target remains undecoded. For example, if the
true shift is mδ, a decoding error is made only if the measured overlap dˆ satisfies
(m + k)δ − ǫ ≤ dˆ ≤ (m + k)δ + ǫ where k is a non-zero integer and (m + k)δ
is a valid overlap. The probability of decoding error can now be computed by
integrating the Gaussian error probability distribution over the error region. This
error probability is conditioned on mδ being the true overlap. Therefore, to calculate
the total probability of decoding error we have to know the a priori probability of
the true overlap being mδ. We assume that this probability is uniformly distributed.
The value of ǫ, in general, is dependent on the structure of the overlap set O. We,
however, have chosen the overlaps to be multiples of the shift resolution δ. As a
result all the overlap values are equally spaced from the adjacent ones. We can
therefore let ǫ be some fixed tolerance value less than or equal to δ/2, where δ, the
shift resolution, is the separation between two successive overlaps. When ǫ = δ/2
we always make a decoding decision. If ǫ < δ/2, there are cases where the measured
overlap does not lie within the tolerance limit of any element of O and we let the
target remain undecoded. In contrast, if the overlap set contains random but unique
overlaps, the tolerance ǫ is a function of O. For instance, the ǫ tolerance value for
overlaps with a large spacing between them will be different from the ǫ tolerance
value for the overlaps that are closely spaced, especially for the case where we always
make a decoding decision. The tolerance value, therefore, will have to be adjusted
94
according to the overlaps.
0
Probability of Decoding Error
10
SNR = 0 dB, M = 2
SNR = 0 dB, M = 3
SNR = 0 dB, M = 4
SNR = 5 dB, M = 2
SNR = 5 dB, M = 3
SNR = 5 dB, M = 4
SNR = 10 dB, M = 2
SNR = 10 dB, M = 3
SNR = 10 dB, M = 4
−2
10
−4
10
−6
10
−8
10
−10
10
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Area Coverage Efficiency (η)
Figure 4.4: Plot of decoding error versus area coverage efficiency for different SNRs
and different values of M, the number of sub-FOVs that overlap. (See [61].)
We can extend the above result to the general case where M sub-FOVs overlap.
In our example we can have a maximum of M = 4. For the case of M = 3, instead of
ˆ we measure two overlaps dˆ1 and dˆ2 resulting from the
measuring a single overlap d,
pair-wise distances between three target ghosts in the superposition space. Therefore, we now have a 2-D Gaussian error probability distribution. The probability
of decoding error is calculated by integrating this 2-D distribution over the region
given by (m1 +k1 )δ−ǫ ≤ dˆ1 ≤ (m1 +k1 )δ+ǫ and (m2 +k2 )δ−ǫ ≤ dˆ2 ≤ (m2 +k2 )δ+ǫ.
Here m1 δ and m2 δ are the true overlaps, and k1 and k2 are non-zero shifts such that
(m1 +k1 )δ and (m2 +k2 )δ are valid overlaps. We again assume that the probability of
95
the true overlaps being m1 δ and m2 δ is uniformly distributed. Extention to M = 4,
where we have a 3-D Gaussian error probability distribution is straightforward.
Figure 4.4 shows the probability of decoding error versus area coverage efficiency
for ǫ = δ/2, Wt = 10∆r distance units, and different values of SNR and M. As the
shift resolution decreases, area coverage efficiency increases, but so does probability
of decoding error. Thus, the choice of shift resolution is a compromise between the
area coverage and the probability of incorrectly decoding the target location. We
also observe that for fixed SNR, as we increase M the decoding error decreases.
We therefore conclude that longer target patterns make the decoding process more
robust and less prone to decoding errors.
4.3.1 Experimental Results
To illustrate how 1-D spatial shift encoding of multiple sub-FOVs can be performed in real-world applications we conducted an experiment using the optical
setup proposed in Fig. 2.1(b). (The experiment was performed at Duke University
in collaboration with Dr. David Brady.) The object space used for the experiment was an aerial map of the Duke University campus, and a laser pointer was
moved across it during the video acquisition to simulate a single moving target. The
object space was 24-mm high and 162-mm wide, and was imaged using the multiplexer in Fig. 2.1(b) onto a commercial video camera (SONY, DCR-SR42). By
96
adjusting the tilts of the mirrors shown in Fig. 2.1(b), we obtained an overlap set
O = {3, 7, 11, 20, 14, 24, 28} × 1mm which deviated slightly from the ideal scenario
of {0, 5, 10, 20, 15, 25, 30} × 1mm. In building the setup, care was taken to make the
path lengths travelled by light from each sub-FOV close to equal. However, there
was a slight difference in path lengths which resulted in varying magnification of
some sub-FOVs. Therefore, the size of each sub-FOV was not uniform: Wf ov =
{35, 34, 33, 35, 33, 32, 33, 34}×1mm and Hf ov = {24, 23, 22, 23, 22, 22, 22, 23}×1mm,
where the ith elements of Wf ov and Hf ov are the width and height of f ovi , respectively.
Figure 4.5 shows three frames illustrating the efficacy of our algorithm using
this experimental setup. Each frame shows the measured superposition space along
with the corresponding hypothesis space. Using the decoding logic discussed in
Section 3.2 we are able to decode the moving target as it enters the region of overlap.
The figure shows how the “missing ghosts” logic reduces ambiguity about the target’s
true location in the hypothesis space. The small deviations of the overlaps from
their true values do not affect the performance because all the overlaps are still
unique. Uniqueness of overlaps is both the necessary and sufficient condition for the
applicability of our decoding strategy. Also, the slight variations in magnification
of the sub-FOVs do not affect the decoding performance.
97
Superposition Space
Hypothesis Space
(a) Undecoded target
(b) Target ambiguity reduced via missing ghost
logic
(c) Decoded target
Figure 4.5: Experimental data frames showing successful decoding of a target moving through the object space. Part (a) shows the undecoded target whose ambiguity
is reduced via “missing ghosts” logic in part (b). Part (c) shows the correctly decoded target.
98
CHAPTER 5
FEATURE-SPECIFIC DIFFERENCE IMAGING
In this chapter we present FS compressive imagers to estimate the temporal changes
in the object scene of interest. The temporal changes are modelled using difference
images. Where possible, we design the optimal sensing matrix for the FS imagers.
In cases where sensing matrix design is not tractable, we consider plausible candidate sensing matrices that use the available a priori information. We also consider
non-adaptive sensing matrices and compare their performance to the knowledge enhanced ones. In conjunction with sensing matrix design, we also develop closed-form
and iterative techniques for estimating the difference images. We specifically look
at ℓ2 - and ℓ1 -based methods. We show that ℓ2 -based techniques can directly estimate the difference image from the measurements without first reconstructing the
object scene. This direct estimation exploits the spatial and temporal correlations
between the object scene at two consecutive time instants. We further develop a
method to estimate a generalized difference image from multiple measurements and
use it to estimate the sequence of difference images. For ℓ1 -based estimation we consider modified forms of the total variation (TV) method and basis pursuit denoising
(BPDN). We also look at a third method that directly exploits the sparsity of the
99
difference image.
5.1 Linear Reconstruction
5.1.1 Data Model
Let x1 and x2 be the object scene at two consecutive time instants t1 and t2 respectively. Following the explanation in Section 2.2, the object scene at both time
instants is assumed to be discretized and is represented as a vector of size N × 1.
Let us also define Φ1 and Φ2 to be the two corresponding optical sensing matrices of
size M × N. The M rows imply taking M measurements of the object scene. Thus,
the sensing matrices can be thought of as M × N projection matrices that project
the scene from an N-dimensional space to an M-dimensional subspace. Using these
sensing matrices we take measurements of the scene at the two time instants. The
data model is given by
y1 = Φ1 x1 + n1 ,
(5.1)
y2 = Φ2 x2 + n2 ,
(5.2)
where n1 and n2 represent the sensor AWGN noise with noise variance σ 2 and zero
mean, and y1 and y2 are the respective measurements of the scene at the first two
consecutive time instants. Our goal is to estimate the difference image ∆x1 given
measurements y1 and y2 by finding the estimation operator that minimizes the ℓ2 -
100
norm of the error between the truth difference image and the estimated difference
image. If we take a Bayesian approach to the linear model in (5.1) and (5.2), then,
minimizing the ℓ2 -norm is the same as minimizing the Bayesian mean squared error
(BMSE). The Bayesian assumption allows us to represent the scene as a stochastic process and as a consequence, allows us to incorporate the (spatial) auto- and
(spatio-temporal) cross- correlation information between the scene at the two time
instants in the estimation operator. Using estimation theory terminology, we call the
estimation operator the linear minimum mean squared error (LMMSE) estimation
operator.
5.1.2 Indirect Image Reconstruction
Before we present our method, we briefly discuss a possible approach - in line with
classical use of MMSE operators - for estimating difference images. We call this
method intermediate image reconstruction (IIR). As illustrated in Fig. 5.1(a), it involves reconstructing each object scene separately from its respective measurements
and then subtracting these intermediate stage reconstructions to get the estimated
difference image. The measurements can be compressive or non-compressive. The
latter approach includes the conventional case of imaging the whole scene (the sensing matrix is an identity matrix) and hence avoids the reconstruction step. The
disadvantage of course is related to more data collection, and not being able to op-
101
timally account for noise in the collected data. There is also the important issue of
the cost of optics required to image the scene with ∆r resolution for large scenes.
We therefore, focus on reconstruction using compressive measurements.
Reconstructing the object scene at both time instants means that (5.1) and
(5.2) can be separated into two stand alone problems. We define the reconstructed
intermediate stage object scene for the two time instants as
x̂1 = F1 y1 , x̂2 = F2 y2
(5.3)
where Fi , i = 1, 2 are the linear reconstruction operators. For each i we separately
minimize the BMSE
J(Fi ) = E[||xi − x̂i ||2ℓ2 ]
(5.4)
with respect to Fi .
The resulting reconstruction operators F1 and F2 are given by the well known
MMSE equation
−1 T −1
T −1
Fi = (R−1
x + Φi Rn Φi ) Φi Rn y, i = 1, 2,
(5.5)
where Rx is the auto-correlation matrix of the object scene and Rn is the noise
covariance matrix. We always assume that we have already subtracted off the mean
from the scene. If this is not the case we can trivially modify (5.5) to account for
the mean. If we make the additional assumption that the first two moments completely describe the scene statistics, then (5.5) will be the optimal solution. These
102
assumptions, however, are rarely true in practice. Despite this restriction, as shown
in Chapter 6, it turns out that LMMSE operators are good and computationally
efficient estimation operators.
Given the reconstruction operators, the indirectly estimated difference image is
given by
∆x̂1 = x̂2 − x̂1 = F2 y2 − F1 y1 .
(5.6)
5.1.3 Direct Difference Image Estimation
The intermediate step involving object scene reconstruction in the IIR method is
an unnecessary step. If we remove that step by reconstructing the difference image directly from the measurements y1 and y2 , we can better estimate the truth
difference image. The reason is that we can now not only incorporate the spatial
correlation between the pixels (auto-correlation of the scene), but also the temporal
correlation (cross-correlation between the scene at the two time instants). We define
the estimated difference image as
∆x̂1 = W1 y1 + W2 y2 ,
(5.7)
where W1 and W2 are the jointly optimized estimation operators. We call this the
direct difference image estimation (DDIE) technique. It is visualized in Fig. 5.1(b).
We start by looking at the simplest case of no sensor noise.
103
(a)
(b)
Figure 5.1: (a) Intermediate image reconstruction, and (b) direct difference image
estimation.
5.1.4 DDIE: Noise Absent
Our DDIE approach makes an initial assumption of perfect knowledge of the scene
at the instant we start taking measurements. From a system’s perspective this is a
reasonable assumption to make. For example, the intial knowledge can be obtained
from a sensor that has been on the scene for a long period of time. Therefore, for
time instant t1 we assume perfect knowledge of the scene (Φ1 is an identity matrix)
and from t2 onward we begin taking compressive measurements of the scene. We
can now re-write the data model (5.1) and (5.2) as:
y1 = Ix1 , y2 = Φ2 x2 .
(5.8)
The BMSE we have to minimize is
J(W1 , W2 ) = E[||∆x − ∆x̂||2ℓ2 ].
(5.9)
104
Differentiating (5.9) with respect to W1 and W2 and equating the two derivatives
to zero, we get the jointly optimized estimation operators to be (see Appendix A)
−1
T −1
W1 = (R21 − R11 ) − Rδ ΦT
2 (Φ2 Rδ Φ2 ) Φ2 R21 R11 ,
T −1
W 2 = R δ ΦT
2 (Φ2 Rδ Φ2 ) ,
(5.10)
(5.11)
−1
T
where Rδ = R22 −RT
12 R11 R12 , R21 = R12 is the (spatio- temporal) cross-correlation
matrix between the scene at two consecutive time steps, and R11 and R22 are the
(spatial) auto-correlation matrices of the scene at the two time instants.
Now that we know the reconstruction operators, it is possible to find the optimal
sensing matrix (in the ℓ2 sense). In fact, it is given by (see Appendix A)
Φ2 = [X 0]QT
(5.12)
where Q is the matrix of the eigenvectors of Rδ and X is any rank-M orthonormal
matrix. This is an expected result, as finding a sensing matrix that minimizes the
mean-squared error between the truth and the estimated difference images in the
absence of noise is analogous to finding a matrix that maximizes the projection
variance of the object scene, and this is the principal component solution. Looking
at (5.12) we see that [X 0] picks out the first M eigenvectors of Rδ to form an
M-dimensional subspace where the projection variance is maximized. Since X is
a rank-M orthonormal matrix, we get a rotated M-dimensional subspace. As a
simple case the orthonormal matrix can be an identity matrix in which case the
105
eigenvectors are the principal components. It is interesting to note that the optimal
sensing matrix solution involves the eigenvectors of Rδ . Rδ can be interpreted in the
following way: given the spatial auto-correlation of scene 1 and the spatio-temporal
cross-correlation between scene 1 and 2, Rδ contains the extra information we get
from the spatial auto-correlation of scene 2. The optimal sensing matrix selects the
directions that maximize this information.
5.1.5 DDIE: Noise Present
In the presence of noise the resulting optimal sensing operators must be modified.
Due to noise, the correlation information is affected and so is Rδ . The optimal
LMMSE estimation operators in the presence of noise are given by (see Appendix B)
W1 = ((R21
− R11 ) − (Rα − Rβ )
T
−1
−1
ΦT
2 (Φ2 Rα Φ2 + Rn2 ) Φ2 R21 )(R11 − Rn1 ) ,
(5.13)
T
−1
W2 = (Rα − Rβ )ΦT
2 (Φ2 Rα Φ2 + Rn2 ) ,
(5.14)
Rα = R22 − R21 (R11 + Rn1 )−1 R12
(5.15)
Rβ = R12 − R11 (R11 + Rn1 )−1 R12 .
(5.16)
where
Here, the no-noise-case Rδ is modified to Rα − Rβ . The matrix Rβ reflects the
loss in correlation information in the presence of noise. If the noise were zero, then
106
Rβ would go to zero and Rα − Rβ would be identical to Rδ . In the presence of
noise though, there is a reduction in the available correlation information and this
reduction is quantified by Rβ .
In the presence of noise, finding an optimal sensing matrix is mathematically
intractable. As a consequence we look at a few plausible candidate sensing matrices.
5.1.6 Sensing Matrices
PCA: We start by looking at two kinds of principal components (PC). For the first
case we let the rows of the sensing matrix Φ be the eigenvectors of the spatial
auto-correlation matrix. To a small extent, this is similar to the solution for the
no-noise case in (5.12) if we were to let X be an M × M identity matrix. However,
this case only considers the spatial correlation and ignores the temporal correlation.
To utilize the temporal correlation information, we also consider the difference
principal conmponents (DPC). DPC are the principal components of the difference
image.
We compute them from the spatio-temporal correlation matrix of the
difference images defined as R∆x1 = E[(x2 − x1 )(x2 − x1 )T ] = 2R11 − R12 − R21 .
Since, R∆x1 is a symmetric matrix, its spectral factorization will give us the
difference principal components. There is a two-fold advantage to DPC. Firstly,
they implicitly use both spatial and temporal correlation information. Secondly, as
we are trying to reconstruct the difference images and not the object scene itself,
107
the principal components of the difference image are more suitable than PC.
PCA waterfilling: PCA is a sub-optimal solution in the presence of noise as
it does not adjust the energies (eigenvalues) of the eigenvectors with changing
SNR. We remedy this by considering weighted PCA which redistributes the total
available energy among the different eigenvectors while accounting for noise. This
re-distribution is achieved by maximizing the mutual information I(x; y) between
the scene x and the measurement y, assuming they are respectively N × 1 and
M × 1 random vectors. We briefly discuss the sub-optimality of PCA and then give
the weighted solution.
Let Rx be the correlation matrix of scene x. Then Rx = UΛUT gives the eigen
decomposition of Rx , where Λ is a diagonal matrix with the eigenvalues (λi , i =
1, ..., N) in decreasing order and columns of U are the corresponding eigenvectors.
Now let noise be added and let the noise covariance matrix be given by σ 2 I. Note
that the eigenvectors in U are also the eigenvectors for the noise because
(σ 2 I)U = U(σ 2 I).
(5.17)
As a result, in the presence of noise the eigenvalues are given by Λ + σ 2 I. Therefore,
the presence of noise simply adds the noise variance to all the eigenvalues without
adapting the eigenspectrum to the given SNR.
108
The PC sensing matrix Φpc = (U(1 : N, 1 : M))T makes measurement y which
lies in the subspace spanned by the first M eigenvectors. We now consider the modified sensing matrix Φwpc = Dw (U(1 : N, 1 : M))T , where Dw = diag(w1 , . . . , wM ).
We are still in the subspace spanned by the first M eigenvectors, but now the diagonal elements wi , i = 1, . . . , M control the weighting given to each eigenvector. We
first maximize I(x; y) given an unknown but fixed Φwpc , and then use it to compute
the weights wi , i = 1, . . . , M. On maximizing the mutual information over the input
distribution of x, we get (see Appendix C)
I(x; y) =
ΣM
i=1
log
wi2 λi + Pn
Pn
,
(5.18)
where the maximizing input distribution is a multi-variate Gaussian. We know that
a real-world object scene is not normally distributed, but nevertheless, we show in
Chapter 6 that we still get marked improvement over PC. In the ideal scenario of
the scene being normally distributed this solution will be optimal. We assume the
logarithm base to be 2.
To find the optimal weights wi , i = 1, . . . , M, we differentiate (5.18) with respect
2
to wi2 under the constraint ΣM
i=1 wi = E , where E is the total energy in the object
scene. Using lagrange multipliers, the objective function can be written as
2
J(w12 , . . . , wM
)
wi2 λi + Pn
2
=
log
− ζ(ΣM
i=1 wi − E),
Pn
2
E
w i λi + Pn
2
M
− ζ(wi − ) .
= Σi=1 log
Pn
M
ΣM
i=1
(5.19)
109
From (5.19) we see that we can differentiate each summand individually to get
wi2 =
1 Pn
−
.
ζ
λi
(5.20)
Since wi2 cannot be negative we rewrite (5.20) as
+
1 Pn
−
=
,
(5.21)
ζ
λi
+
Pn
1
M
= E. From (5.21) we see
and we choose the value of ζ such that Σi=1 ζ − λi
wi2
that the weights assigned to the different eigenvectors are a function of
just λi . Equation (5.21) says: put the available energy where
λi
Pn
λi
Pn
and not
is large. This
is the waterfilling solution [75]. We perform waterfilling for both PC and DPC
resulting in sensing matrices WPC and WDPC, respectively.
Optimal solution: Since it is not mathematically tractable to find an optimal
sensing matrix in the presence of noise, we also numerically search for the
optimal solution. We perform this search using stochastic tunneling [76]. For
multi-dimensional surfaces with multiple minima – in our case the mean squared
error surface as a function of the sensing matrix – it is easy to get trapped in a
local minima without reaching the global minimum. Stochastic tunneling (ST) is
a numerical technique that overcomes this disadvantage by applying a non-linear
transformation to the error surface.
Stochastic tunneling is a generalization of a Monte Carlo technique called sim-
110
ulated annealing (SA) for finding global minimum of a multi-dimensional potential
energy surface. In SA the search for the global minimum is done by simulating a
dynamical process of a particle rolling on the potential energy surface. The energy
of the particle is controlled by the temperature and the cooling rate parameters. By
developing an optimal cooling schedule (decreasing the temperature in controlled
steps) the goal is for the rolling particle to successfully reach the global minimum
without getting trapped in any of the local minima. The choice of a good cooling
schedule, however, is a difficult one, especially for energy surfaces with a rough terrain, and consequently, SA is known to suffer from what is known as the “freezing”
problem. It refers to the reduced probability of the particle escaping a local minimum due to decreasing temperature. If, however, the particle is given enough energy
not to get trapped in any local minima, then the probability of the particle overshooting the global minimum increases. In fact, with the increased temperature the
chance of the particle stopping in any minima local or global becomes equiprobable,
and the ability of the particle to resolve different energy levels diminishes. In [76],
the authors developed ST to avoid these extremes, and find the global minimum of
a multi-dimensional cost function.
Let us label our cost function as J(Φ2 ). Unlike (5.4), the cost function dependence has been explicitly written in terms of the sensing matrix Φ2 . The non-linear
111
transform of J(Φ2 ) is then given by
JST (Φ2 ) = 1 − e−γ(J(Φ2 )−J0 )
(5.22)
where J0 is the lowest cost function value yet encountered by the particle, and γ > 0
is the parameter that controls the non-linearity rate at which cost function values
greater than J0 are supressed. To understand (5.22) better, consider Fig. 5.2. This
figure, which is in same vein as the example given in [76], illustrates the effect of
non-linear transformation on a simplified 1-D cost function. Figure 5.2(a) depicts
the cost function with multiple local minima and a global minimum. The energy
barrier between two adjacent local minima is higher than the energy differential
between the two minima. In ST, instead of jumping over the barrier, the particle
“tunnels” through it. This tunneling is achieved by the non-linear transformation.
If the lowest cost function value encountered by the particle is J0 , then the nonlinear transformation collapses all values greater than J0 to the interval (0, 1) while
preserving the locations of the minima. This is illustrated in Fig. 5.2(b) for J0 = 0.
Thus the particle tunnels through towards the global minimum, and the optimal
sensing matrix, by adjusting the value of J0 at each iteration and thus removing
high energy barriers.
112
6
Particle’s current position
4
2
0
−2
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
−1.5
−1
−0.5
0
0.5
1
1.5
2
1
0
−1
−2
−3
−2
Figure 5.2: Given the current position of the particle (the current cost function
value J0 ), the figure shows the effect of the non-linear mapping. All energy levels
higher than the current particle position are mapped to the interval (0,1) thereby
getting rid of irrelevant and high energy surface features, while the lower energy
minima locations are preserved.
113
5.1.7 Multi-step DDIE and Lth Frame Generalized Difference Image Estimation
As depicted in Fig. 5.3(a) we assume knowledge of the scene at the first time
instant and from then on take measurements of the scene.
Our model al-
lows us to use a different sensing matrix Φ at every successive time instant.
For simplicity, however, we assume Φk = Φ, k > 1.
Consequently, we have
{x1 , Φx2 , Φx3 , . . . , ΦxL , . . .} = {y1 , y2 , y3 , . . . , yL , . . .} sequence of measurements.
Until now we have looked at the DDIE method for estimating the difference image between the object scene at the first two time instants. We now extend DDIE
method to estimate {∆x̂1 , ∆x̂2 , . . . , ∆x̂L−1 , . . .} sequence of difference images from
{y1 , y2 , y3 , . . . , yL , . . .} sequence of measurements. We call this strategy multi-step
DDIE. We also present a different approach to estimating the difference image sequence by jointly using measurements taken over multiple time instants.
The DDIE method assumes knowledge of scene #1 and takes measurements of
scene #2 to estimate the difference image. Therefore, in multi-step DDIE, given
measurements of the object scenes at tk and tk+1 , we need knowledge of the scene at
tk . Multi-step DDIE acquires this knowledge by propagating forward the knowledge
of the scene at t1 . (See Fig. 5.3(a).) The forward propagation is done using the
114
(a)
(b)
Figure 5.3: Multi-step DDIE: (a) Perfect knowledge of the scene is assumed at time
instant t1 and measurements are made from t2 on. Difference image is estimated by
propagating forward the object scene knowledge. This propagation is indicated by
curved arrows from left to right. (b) At each pair of consecutive time instants tk and
tk+1 , (5.23) is implemented to estimate the difference image ∆xk and propagate the
scene knowledge.
recursive equation
∆x̂k = recon(x̂k , yk+1), k > 1
= recon((x̂k−1 + ∆x̂k−1 ), yk+1 ), k > 1.
(5.23)
where recon refers to DDIE method discussed in section 5.1.5. For k = 1, we replace
x̂1 with x1 because we assume perfect knowledge of the scene at t1 . Equation (5.23)
takes the estimate of the scene at tk and measurements at tk+1 and estimates the
difference image ∆x̂k . It then propagates forward the knowledge of the scene at
tk by computing the estimate x̂k+1 = x̂k + ∆x̂k at tk+1 . This is illustrated in
Fig. 5.3(b).
115
We refer to our second approach as Lth frame generalized difference image estimation (LFGDIE). Given {y1 , y2 , y3 , . . . , yL }, LFGDIE directly estimates the generalized difference image ∆x1L between the object scene at t1 and tL .
To obtain the LFGDIE estimation operators we first define the LFGDIE data
model,
y1 = Ix1 + n1 ,
(5.24)
y2 = Φx2 + n2 ,
(5.25)
···
yL = ΦxL + nL ,
(5.26)
where I is the identity sensing matrix symbolizing complete knowledge of the initial
scene. Note that this model is an extention of (5.1) and (5.2) because here we
consider multiple measurements. The estimated difference image is then given by
∆x̂1L =
L
X
W′ i yi ,
(5.27)
i=1
where W′ i , i = 1, . . . , L are the estimation operators. Re-writing (5.27) in matrix
form we have
∆x̂1L = [W′ 1 W′ p W′ L ][y1T ypT yLT ]T ,
(5.28)
T
where W′ p = [W′2 W′ 3 · · · W′ L−1 ] and yp = [y2T y3T · · · yL−1
]T . Let us also define
T
T
T
T
T
T
T
xp = [xT
2 x3 · · · xL−1 ] , R11 = E[x1 x1 ], RL1 = E[xL x1 ] = R1L , Rp1 = E[xp x1 ] =
116
T
T
RT
1p , RpL = E[xp xL ] = RLp , ΦL = Φ and Φp = Ip ⊗ Φ, where Ip is a p × p identity
matrix.
Minimizing the BMSE between ∆x1L and ∆x̂1L by differentiating it with respect
to W′1 , W′p and W′L , and equating the derivatives to zero, the reconstruction
operators turn out to be (see Appendix B)
W′ 1 = (RL1 − R11 − W′ p Φp Rp1 − W′ L ΦL RL1 )(R11 + Rn1 )−1 ,
(5.29)
T
−1
W′ p = (Rαp + RαL ΦT
L (ΦL RLγL ΦL + RnL ) ΦL Rβp )
−1
T
ΦT
p (Φp RΦL Φp + Rnp ) ,
T
−1
W′ L = (RαL + W′p Φp RβL )ΦT
L (ΦL RLγL ΦL + RnL ) ,
(5.30)
(5.31)
where
RαL = RLL − R1L − (RL1 − R11 )(R11 + Rn1 )−1 R1L ,
(5.32)
Rαp = RLp − R1p − (RL1 − R11 )(R11 + Rn1 )−1 R1p ,
(5.33)
RβL = Rp1(R11 + Rn1 )−1 R1L − RpL ,
(5.34)
Rβp = RL1 (R11 + Rn1 )−1 R1p − RLp ,
(5.35)
RγL = RL1 (R11 + Rn1 )−1 R1L ,
(5.36)
Rγp = Rp1(R11 + Rn1 )−1 R1p ,
(5.37)
RLγL = RLL − RγL ,
−1
RΦL = Rpp − Rγp − RβL ΦTL (ΦL RLγL ΦT
L + RnL ) ΦL Rβp .
(5.38)
(5.39)
117
It is interesting to note that when L = 2, that is p = 0, RLγL = Rα , RαL = Rα −Rβ ,
W′ p disappears, W′ 1 = W1 and W′ L = W2 . Therefore, LFGDIE method reduces
to the multi-step DDIE method. It is when L > 2 that we see the benefit of
employing LFGDIE method. To see this let x1 and xL be the object scenes at times
instants t1 and tL . Then the generalized difference image is given by
∆x1L = xL − x1
(5.40)
Re-writing (5.40) we get
∆x1L = (xL − xL−1 ) + (xL−1 − xL−2 ) + · · ·
+ (x3 − x2 ) + (x2 − x1 )
=
L−1
X
∆xi ,
(5.41)
i=1
where the right-hand side is a pairwise sum of difference images of the scene at
successive time instants. The estimate of the generalized difference image ∆x1L is
∆x̂1L =
PL−1
\
i−1 ∆xi . Therefore, estimation of the LFGDIE operator requires joint
estimation of all the successive pairwise difference images ∆xi . This joint estimation
exploits the spatial and temporal cross-correlations between the scene at all the L
time steps, as is manifested in equations (5.29) through (5.39). The multi-step
DDIE method on the other hand estimates
PL−1
i−1
∆x̂i , which exploits only pairwise
cross-correlation between the scene at two successive time steps. This ability of
118
LFGDIE method to perform joint estimation leads to its superior performance over
multi-step DDIE, as we show in Chapter 6.
5.2 Non-linear Reconstruction
The advantage of linear ℓ2 -based estimation lies in its ability to provide closed form
linear estimation operators that minimize the mean-squared error over an entire
ensemble of object scenes. Difference images, however, are sparse and ℓ2 -based
difference image estimation does not exploit this characteristic. We therefore, extend
our study to look at ℓ1 -based estimation of the difference images. We are motivated
by a few reasons, each of which looks at the problem from a different perspective.
Firstly, as mentioned above, difference images are sparsely represented in pixel space
(a finite-dimensional Euclidean space), and exploiting this sparsity for difference
image estimation seems to be a natural extension of the image restoration problem.
Secondly, modelling optical images as functions of bounded variation (BV) has been
successfully used in image denoising and restoration. The use of an ℓ1 -based total
variation (TV) measure [77] in this context has been shown to accurately estimate
edge features, which are important components in difference images. Thirdly, signal
decomposition using overcomplete dictionaries gives sparse signal representations
with respect to atoms of these dictionaries. It has been shown that basis pursuit
(BP) gives an optimal (in ℓ1 sense) solution for this signal decomposition problem
119
[78]. These three approaches fit well into the ℓ1 -based estimation of difference image.
We consider ℓ1 -based difference image estimation problem as a linear inverse
problem. The linearity comes from the forward data model being defined through a
linear transform D (applied to the input s in the presence of noise)
y = Ds + n.
(5.42)
The goal then of the linear inverse problem is to estimate s given the noisy measurements y. The model in (5.42) is typical of ℓ1 -based reconstruction problems.
However, the model we have in (5.1) and (5.2) is not of the same form as (5.42).
Consequently, we re-write (5.1) and (5.2) as
y = ΦD x + n,
where





(5.43)



 n1 
 x1 
 y1 
 I 0 
.
,n = 
,x = 
,y = 
ΦD = 








n2
x2
y2
0 Φ
(5.44)
Incorporating a sparse representation of x with respect to a sparsifying dictionary
Ψ in (5.43) we have
y = ΦD Ψs + n
(5.45)
where s is a sparse representation of x, i.e. x = Ψs. (Comparing (5.42) with (5.45),
we have D = ΦD Ψ.) The solution ŝ, and therefore x̂, to the linear inverse problem
120
is then given by solving the optimization problem
arg min ||y − ΦD Ψs||2ℓ2 + ξR(s).
s
(5.46)
Here, the ℓ2 term is the fitness term controlling how much of a fit the solution is to
the measured data, while R(s) is regularizer term controlling how much the solution
meets the desired constraint. We define R(s) to be an ℓ1 convex regularizer, its form
being decided by the three points of view we are considering. The weighting factor
ξ is the regularization parameter.
From the first point of view, difference images are sparse in the pixel basis.
The pixel basis can be thought of as the standard basis for a finite-dimensional
Euclidean space, where the dimension is that of the object scene. Consequently, Ψ
is the identity matrix and s = x. But to maximize sparsity, we define the regularizer
R to be be a function of ∆x1 instead of s as follows
R(∆x1 ) = ||x2 − x1 ||ℓ1 = ||∆x1 ||ℓ1 .
(5.47)
It is evident that this ℓ1 regularizer enforces the sparsity constraint on the difference
image, by favoring values closer to zero. Note that the regularizer R(s) = ||s||ℓ1 does
not maximize the sparsity of the difference image, but just minimizes the ℓ1 -norm
of s, and consequently, is not optimal.
For the TV restoration problem we again consider Ψ to be an identity matrix
because in this formulation, the function space of bounded variation is defined to
121
be on a discrete finite support. The regularizer R for the TV problem is usually
defined to be either
Riso (x) =
Xq
(∆hi x)2 + (∆vi x)2
or,
(5.48)
i
Rniso (x) =
X
i
|∆hi x| + |∆vi x|,
(5.49)
where Riso (x) and Rniso (x) are the isotropic and non-isotropic discrete TV regularizers respectively. The ∆hi and ∆vi operators are, respectively, the first order horizontal
and vertical difference operators. Instead of imposing the TV condition on s (= x)
however, as in (5.47), we impose it on the difference image ∆x1 = x2 − x1 . Now
the regularizer is defined as either
Xq
Riso (∆x1 ) =
(∆hi ∆x1 )2 + (∆vi ∆x1 )2
or,
(5.50)
i
Rniso (∆x1 ) =
X
i
|∆hi ∆x1 | + |∆vi ∆x1 |.
(5.51)
It is easy to see that a bounded total variation assumption for the object scene x1
and x2 results in x2 − x1 also having bounded variation. Therefore, this modified
form does not violate any TV condition.
For the sake of completeness we also look at difference image estimation using an
overcomplete sparsifying dictionary Ψ. Specifically, we consider the dictionary to
be the symmetric biorthogonal wavelet transform1 and the regularizer to be R(s) =
1
Cohen-Daubechies-Feauveau 9/7 wavelet transform [79].
122
||s||ℓ1 . This results in the familiar basis pursuit denoising (BPDN) model which
decomposes the signal as superposition of the atoms of Ψ such that the ℓ1 norm of s
is smallest of all possible decompositions over the dictionary. The set up here though
is slightly different from traditional BPDN in that we include the sensing matrix
ΦD in the model. In traditional BPDN the sensing matrix is an identity matrix.
This modification does not affect the fundamental problem. The classical BPDN
problem can be thought of as finding the regularized denoised sparse representation
of the object scene from the noisy version of the scene. The modified BPDN on the
other hand, finds the regularized denoised sparse representation of the object scene
from measurements of the scene using the sensing matrix ΦD .
T T
Notice that by applying (5.45) we use s to estimate x = [xT
1 x2 ] and not ∆x1 .
Therefore, because of the system constraint, there is the additional step of computing
∆x1 from the estimated x. Estimating s, however, allows us to take advantage of the
correlation between the object scene at the two time instants. By solving (5.46), with
R(s) = ||s||ℓ1 , we compute a joint estimate of x1 and x2 in the form of s. Note that,
although x̂ is separable into x̂1 and x̂2 , s is not. Similarly, in (5.43) we jointly exploit
the scene at the two time instants by considering regularization terms (5.47), (5.50)
and (5.51) that are functions of the difference image. Regularizer (5.47) sparsifies
the difference image, while regularizers (5.50) and (5.51) minimize the total variation
of the difference image. In fact, by acting directly on the difference image, (5.43)
123
exploits the correlation between the scene at the two time instants more strongly
than (5.45).
Extension to estimating sequences of difference images follows directly from
multi-step DDIE and more specifically (5.23). We will use the ℓ1 -based techniques
in the multi-step setting. In Chaper 6 we present the performance of these three
approaches.
Learning sensing and sparsifying matrices
Given the above approaches to ℓ1 -based difference image estimation, one question
that naturally arises is whether we can learn the optimal Φ and Ψ from available
training data. Carvajalino and Sapiro [37] proposed a very interesting scheme to
simultaneously learn ΦD and Ψ using training data. The sparsifying dictionary is
assumed to be overcomplete. This assumption makes it difficult for us to apply it
to our current context. Consider the forward model for computing the difference
image from the object scene at two consecutive time instants. Let the time instants
be t1 and t2 . Then we have
T T
∆x1 = x2 − x1 = [−I I][xT
1 x2 ] = [−I I]x.
(5.52)
Here [−I I] is N × 2N matrix with both x1 and x2 being N × 1 vectors. We
know that ∆x1 is sparse. Therefore ideally, we would like to find Ψ such that,
124
when x = Ψθ, then θ = ∆x̂1 . This, however is not possible using the algorithm
proposed by Carvajalino et al. because Ψ will be a 2N × N matrix which is not an
overcomplete dictionary. Our model has a unique characteristic that the forward
representation is overcomplete while the one in the other direction is not. This
is unlike most compressive sensing signal models where overcompleteness of the
dictionary in this other direction is exploited to achieve sparsity. We can of course
compute the pseudo-inverse of [−I I], but, that is an ℓ2 solution. We therefore let
ΦD be as defined in (5.44).
125
CHAPTER 6
FEATURE-SPECIFIC DIFFERENCE IMAGING RESULTS
We now present our results for ℓ1 - and ℓ2 -based difference image estimation methods.
We evaluate the performance using measured video imagery of an urban intersection
(object scene; see Fig. 1.3) as the input into a simulation that models compressive
measurements. We use a Panasonic PV-GS500 video camcorder to image the object
scene. The reason we use conventionally imaged data as the truth data and simulate the compressive optical imaging system is to achieve flexibility in accurately
implementing different sensing matrices Φ. This flexibility is required to analyze
performance of our proposed sensing matrices in estimating sequence of difference
images based on ℓ1 - and ℓ2 -norms. On the other hand, we now have to computationally do what a compressive imaging system will do optically. Consequently, instead
of considering the entire 480 × 720 object scene, we reduce the problem by looking
at the scene in 8 × 8, 16 × 16 and 32 × 32 blocks. The blocks are stitched together
to reconstruct the difference image.
To compute the spatial and temporal correlations, we use a training set comprising 6000 frames of the object scene. From each 480 ×720 frame, we chose at random
30 blocks of sizes 8×8, 16×16 and 32×32 to give us 180,000 samples to compute the
126
spatial auto-correlation matrix. To obtain the spatio-temporal cross-correlation matrix between object scenes at consecutive time instants, we select pairs of successive
frames, and select at random 30 pairs of 8 × 8, 16 × 16 and 32 × 32 blocks from each
frame pair, to again give us approximately 180,000 sample pairs. A pair of blocks
consists of two blocks each drawn from the same region of the two consecutive image frames. Similarly, we extend the spatio-temporal correlations over longer time
lengths for LFGDIE method where, instead of considering two successive frames we
consider multiple consecutive frames. Once computed, these correlation matrices
are stored for use in the testing stage
The testing data comprises 6000 frames. These frames are sub-divided into
100 groups, each comprising the object scene at 60 consecutive time instants. The
testing set includes diverse cases of multiple targets moving at different speeds and
directions. All performance plots (RMSE vs. SNR and RMSE vs. number of
measurements per block (M)) in the following analysis have been averaged over the
60 time steps and the 100 groups.
As discussed in Chapter 5 there is no mathematically tractable optimal (ℓ2 ) sensing matrix. Presently, we consider some possible choices for Φ, which were discussed
in Chapter 5. To remind the reader, they are: principal components (PC), difference
principal components (DPC), waterfilled principal components (WPC), waterfilled
difference principal components (WDPC), numerically computed optimal sensing
127
(a)
(b)
(c)
Figure 6.1: Estimated difference image for SNR = 10 dB and M = 5 using (a)
WDPC, (b) Optimal, and (c) Truth difference image
matrix (Optimal) and Gaussian random sensing matrix (GPR). The GPR sensing
matrix represents a set of fixed (aymptotically) basis projections that do not use any
a priori information about the scene. The entries of the GPR matrix are Gaussian
distributed with a mean of zero and a variance of one. We also consider the identity
matrix (Conventional) to mimic the conventional imager. We use the conventional
imager for baseline performance comparison. The conventional imager always im-
128
ages the whole scene, that is, it always makes 480 × 720 = 345600 measurements
per frame.
6.1 ℓ2 -based Difference Image Estimation
Figure 6.1 gives an example of an estimated difference image from a sequence of
difference images using multi-step DDIE. The block size is 8 × 8, the SNR value is
10 dB and the number of measurements per block (M) is 5. For N = 480 ∗ 720 =
345600-dimensional object scene, M = 5 translates to 27000 measurements and a
compression of measured data by 92.2%. The illustrated example has been computed
with WDPC and optimal sensing matrices. We see that the performance of the
optimal Φ is visually close to that of the truth difference image. More importantly,
WDPC also estimates the difference image with good results. Note that it is much
easier and computationally efficient to compute WDPC than to numerically find the
optimal sensing matrix.
We quantify these results by plotting the root mean squared error (RMSE) as a
function of SNR. We define RMSE as
RMSE =
s
X
i
(∆xi − ∆x̂i )2 /
X
(∆xi )2 .
(6.1)
i
The normalization is done using the truth difference image ∆x.
Figure 6.2 plots the multi-step DDIE method’s RMSE performance as a function
129
Number of Measurements per block, M = 5
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
RMSE
RMSE
Number of Measurements per block, M = 1
1
0.5
0.4
0.3
0.2
0.1
0
−10
PC
DPC
WPC
WDPC
GRP
Optimal
Conventional
0
10
0.5
0.4
0.3
0.2
0.1
20
SNR (in dB)
(a)
30
40
50
0
−10
PC
DPC
WPC
WDPC
GRP
Optimal
Conventional
0
10
20
30
40
50
SNR (in dB)
(b)
Figure 6.2: RMSE vs. SNR plots for 8 × 8 block size. (a) M = 1, and (b) M = 5.
M is the number of measurements per block
of SNR for M = 1, 5, and for 8 × 8 block size. We see an improvement in quantitative performance with more measurements, with RMSE significantly decreasing
for M = 5. Figure 6.3, however, shows that this is not true generally for increasing
measurements. With increasing M the performance first improves and then can
degrade. This behaviour is a direct result of enforcing the photon count constraint.
For small M any additional measurement adds more information. But, because
the total number of photons is fixed, as the number of measurements increase, additional information per measurement goes down. Eventually, the additive noise
overwhelms the incremental information and we see a degradation in performance
for larger M. This is true for PC, DPC and WPC sensing matrices. However, both
WDPC and optimal sensing matrices avoid the degradation in performance with increasing measurements because they are optimized for a given SNR. Measurements
130
are considered until they improve performance. Once information per measurement
begins to be drowned out by noise, the additional measurements are ignored.
SNR (in dB) = 20
1
0.9
0.8
RMSE
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
PC
DPC
WPC
WDPC
GRP
Optimal
Conventional
5
10
15
20
25
30
No. of Measurements per block (M)
Figure 6.3: RMSE vs. number of measurements M for SNR = 20 dB for 8 × 8 block
size.
There seems to be a certain discrepancy between the example estimates in
Fig. 6.1 and the plots in Fig. 6.2(b). Even though the visually estimated difference image looks good, the plots for M = 5 have a relatively high RMSE. This
discrepency is because ℓ2 -minimization minimizes the mean-squared error over an
ensemble and does not explicitly enforce sparsity. Consequently, there are small deviations (from the true pixel values) spread out over the whole estimated difference
image. These small deviations from the truth, when quantified against the sparse
truth difference image, bias the RMSE.
From Fig. 6.2 we can make a few observations about the efficacy of the various
131
sensing matrices. As expected, waterfilling does improve performance of both PCs
and DPCs by weighting the projections according to noise statistics. This is more
evident in Fig. 6.2(b) where we take five measurements per block. Numerically
searching for the optimal sensing matrix further improves upon WPC and WDPC.
However, the advantage the latter have over a numerical search is that they are
much simpler to compute for changing SNR. Numerical search involves stochastic
tunneling over an M × N multivariate surface. Waterfilling solution, on the other
hand, linearly scales with M. Therefore, searching for the optimal solution is reasonable only if improvement in performance outweighs the increased computational
cost. From Fig. 6.1 we see that this not the case. For the sake of completeness we
have also plotted the RMSE performance of Gaussian random projections. We are
aware that there is no theoretical framework for non-adaptive GRP to outperform
sensing matrices exploiting a priori information. This is experimentally validated
in the plots, which show GRP to be performing the worst. Lastly, for ℓ2 -based estimation the conventional imager outperforms all the sensing matrices. This is to be
expected as minimizing the mean-squared error alone does not give an advantage to
compressive measurements over a conventional imager. We need an additional constraint, which for our case turns out to be sparsity. We show in the next sub-section
that we can actually beat the conventional imager performance when we enforce
sparsity via non-linear estimation. We stress, however, that as seen in Fig. 6.1, the
132
qualitative performance of WDPC is visually close to that of the truth difference
image. In fact, from a practical perspective, it can be used to provide a good input
to a tracker.
Multi-step DDIE, defined in (5.23), forms a closed loop between the estimate
of the scene and the difference image. As a result, there will be degradation in
performance over time. To grade the performance of multi-step DDIE, we consider
a clairvoyant scenario for estimating the sequence of difference images. We will refer
to this special case as single-step DDIE. Assuming we are estimating the difference
image of the scene between tk and tk+1 , single-step DDIE always assumes perfect
knowledge of the scene at tk ,
∆x̂k = recon(xk , yk+1 ).
(6.2)
Obviously single-step DDIE is practically infeasible. However, it does bound the
performance of multi-step DDIE and as such allows us to evaluate the efficacy of
the multi-step strategy. Figure 6.4(a) shows the performance comparison between
the two for 60 time steps for M = 5 and SNR = 20 dB. Since single-step DDIE
assumes perfect knowledge at every stage, its RMSE as a function of time is nearly
constant. The RMSE for multi-step DDIE is the same as single-step DDIE at t1 .
With passing time, however, the performance of multi-step DDIE degrades. But
as can be seen, the rate of degradation is slow showing that multi-step DDIE is
133
temporally robust. In Fig. 6.4(b) we plot the maximum divergence of multi-step
DDIE from the ideal single-step DDIE for changing SNR. Maximum divergence is
the maximum value by which the multi-step method diverges from the single-step
method over the 60 time-steps. The resulting plot is a line with a small slope
indicating multi-step DDIE does not diverge significantly for changing SNR.
SNR 20 dB, M = 5
1
1
Multi−step
Single−step
0.9
0.9
Maximum Divergence
0.8
RMSE
0.7
0.6
0.5
0.4
0.3
0.2
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.1
0
0.8
10
20
30
Frame
(a)
40
50
60
0
−10
0
10
20
30
40
50
SNR (in dB)
(b)
Figure 6.4: (a) Performance comparison between single-step and multi-step DDIE
methods for M = 5 and SNR = 20 dB using WDPC. Changing RMSE is compared
over time. (b) Maximum divergence between single-step and multi-step DDIE methods for varying SNR.
We now look at the performance of LFGDIE method for estimating a sequence
of difference images. We claimed that LFGDIE will perform better than multistep DDIE as it is able to exploit temporal correlation between all time instants.
Figure 6.5 shows that this is indeed the case. RMSE performance has improved
compared to Fig. 6.2(b). As expected, however, the trends are still the same. WDPC
still outperforms all other candidate sensing matrices with the exception of the
134
numerically searched optimal sensing matrix.
Number of Measurements per Block, M = 5
1
0.9
0.8
RMSE
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−10
PC
DPC
WPC
WDPC
GRP
Optimal
Conventional
0
10
20
30
40
50
SNR (in dB)
Figure 6.5: RMSE vs. SNR plots for LFGDIE method. Block size is 8 × 8 and
M = 5.
Until now we have considered block size of 8 × 8. In Fig. 6.6 we graph the RMSE
performance as function of SNR for 16 × 16 block size. We see that for M = 1 there
is an improvement in performance over 8 × 8 block size. In fact, the performance we
get for 16 × 16 block with M = 1 is similar to the performance of 8 × 8 block with
M = 5. Notice that M = 1, for 16×16 block size, implies 1350 measurements for the
whole scene which translates to less than 0.4% measurements in comparison to the
conventional imager. Thus, we see that for the larger block size we have improved
performance with simultaneously higher compression ratio. Figure 6.7, however,
shows that the amount of improvement in performance we get by going from block
size 8 × 8 to 16 × 16 is reduced when we go from bock size 16 × 16 to 32 × 32. This
happens because any stationarity that holds for small block sizes of 8 × 8 begins
135
to break for larger block sizes. As a result the spatial structure represented by the
sample auto- and cross-correlation matrices obtained from training data does not
completely represent the true correlation. The LFGDIE method also has the same
trends as multi-step DDIE. ℓ1 -based estimation on the other hand does not depend
on a stationarity assumption. Its goal is to find the best estimate based on the
measured data that enforces the sparsity constraint. In the following sub-section we
discuss the performance of ℓ1 -based estimation methods.
Number of Measurements per Block, M = 1
1
0.9
0.8
RMSE
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−10
PC
DPC
WPC
WDPC
GRP
Optimal
Conventional
0
10
20
30
40
50
SNR (in dB)
(a)
Figure 6.6: RMSE vs. SNR performance plots for block size 16 × 16, for M = 1.
6.2 ℓ1 -based Difference Image Estimation
The advantage of using the ℓ2 -norm is that we get precise linear estimation operators,
which allow for quick and easy computation of the estimate of the difference image.
The disadvantage, however, has to do with the inability to exploit the sparsity of
136
Number of Measurements per Block, M = 5
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
RMSE
RMSE
Number of Measurements per Block, M = 1
1
0.5
0.4
0.3
0.2
0.1
0
−10
0.5
0.4
0.3
0.2
8 by 8
16 by 16
32 by 32
0
0.1
10
20
30
40
50
0
−10
SNR (in dB)
(a)
8 by 8
16 by 16
32 by 32
0
10
20
30
40
50
SNR (in dB)
(b)
Figure 6.7: Optimal sensing matrix based performance comparison between the
three block sizes 8 × 8, 16 × 16 and 32 × 32 for (a) M = 1, and (b) M = 5.
a difference image. Solving the convex optimization problem of (5.46) allows us to
overcome this disadvantage.
Here we consider the PC, DPC, WPC, WDPC and GRP sensing matrices. All
abbreviations are the same as before. Gaussian sensing matrices (GRP) have been
suggested in theory of compressed sensing for measuring data because they are incoherent with all representation bases. As a result they have become nearly universal
in applications of compressed sensing for being able to reconstruct signals of interest
without prior knowledge of the signal structure.
Figure 6.8 shows examples of estimated difference image for 8×8 block size, using
the three ℓ1 -based methods discussed in Section 5.2. Visually all three methods are
effective, although TV method performs better than the other two. Both isotropic
and non-isotropic TV regularizers have similar performance. All the results shown
137
(a)
(b)
(c)
Figure 6.8: Estimated difference image for SNR = 10 dB and M = 5 for (a) sparsity
enforced difference image, (b) TV method, and (c) BPDN method.
here are for non-isotropic TV regularizer. In Fig. 6.9, we plot the RMSE performance
of the TV method as a function of SNR. The block size is 8×8, and M = 1, 5, 10, 20.
For M = 1, 5 all sensing matrices have similar performance. In fact, at low SNR all
sensing matrices perform better than the conventional imager. Thus, unlike ℓ2 -based
estimation, ℓ1 -based methods are better able to utilize the concentration of energy
into a few measurements. This is surprisingly true even for GRP sensing matrix. At
138
low SNR there is a higher premium on the available energy and as a consequence, a
small number of random measurements perform better than the conventional imager
where the small energy is spread over all the N = 64 measurements. For M = 10, 20
the curves for different sensing matrices begin to separate. Yet the performance of
PC and DPC sensing matrices is similar. This is because ℓ1 -based estimation does
not directly estimate the difference image, but instead, jointly estimates the scene
at the two time instants. As a result, we cannot take advantage of the difference
image form which ℓ2 -based estimation afforded us. We see that waterfilling improves
upon both PC and DPC, but due to the same reason, the performance of WPC and
WDPC is similar too. The WPC and WDPC sensing matrices give the best RMSE
performance.
In Fig. 6.10 we compare the performance of the three ℓ1 -based methods by looking at RMSE vs. SNR curves for WDPC sensing matrix. Within the three ℓ1 strategies we see that the TV method performs better than sparsity enforced difference
image method and BPDN. The improvement though is small, especially compared
to sparsity enforced difference image method. By minimizing the gradient of the
difference image, the TV method is best able to capture the changes in intensity
across the difference image. On the other hand, difference images are sparse too,
and hence, enforcing sparsity also gives good results. The BPDN method has a
higher RMSE compared to the other two strategies. This is to be expected because
139
although BPDN also minimizes the ℓ1 -norm with respect to a sparsifying basis, it
does not directly utilize the difference image as is done by the regularization term of
the other two methods. Instead the BPDN method computes the joint sparse representation s of the object scene at the two time instants. This results in reduced
performance. But, as corroborated in Fig. 6.8, the degradation in performance is
small.
Number of Measurements per Block, M = 1
Number of Measurements per Block, M = 5
10
10
PC
DPC
WPC
WDPC
GRP
Conventional
9
8
8
7
6
RMSE
RMSE
7
5
4
6
5
4
3
3
2
2
1
1
0
−30
−20
−10
0
10
20
30
40
PC
DPC
WPC
WDPC
GRP
Conventional
9
0
−30
50
−20
−10
SNR (in dB)
10
20
30
40
50
SNR (in dB)
(a) M = 1
(b) M = 5
Number of Measurements per Block, M = 10
Number of Measurements per Block, M = 20
10
10
PC
DPC
WPC
WDPC
GRP
Conventional
9
8
7
8
7
6
5
4
6
5
4
3
3
2
2
1
1
0
−30
−20
−10
0
10
20
30
40
PC
DPC
WPC
WDPC
GRP
Conventional
9
RMSE
RMSE
0
50
0
−30
−20
−10
0
10
20
SNR (in dB)
SNR (in dB)
(c) M = 10
(d) M = 20
30
40
Figure 6.9: RMSE vs. SNR curves for 8 × 8 blocks, for the TV method
50
140
Plotting RMSE as a function of M shows us that we are able to beat the performance of the conventional imager. This is illustrated in Fig. 6.11. At low SNR
and for smaller measurements we see a wide gap in performance between all sensing
matrices and the conventional imager. Note that the conventional imager always
makes 720 ∗ 480 = 345600 measurements. As the number of measurements increase
the performance of all the sensing matrices begins to degrade. The rate of this
degradation is a function of SNR. It slows down as SNR increases. When the SNR
is high there is no advantage to be obtained from better utilizing the energy as
there is enough energy for all measurements, and the conventional imager performs
the best. Finally we note that sensing matrices using a priori information perform
better than GRP. Looking at the RMSE vs. SNR performance of GRP we see that
it has the most error at all SNRs and for any number of measurements. At the
same time RMSE vs. M plots show that the rate at which the GRP performance
degrades is fastest amongst all sensing matrices.
Increasing block size leads to improved performance for fewer number of measurements as a fraction of the total. Unlike the ℓ2 -based method here we do not
suffer from the stationarity assumption. In Fig. 6.12 we plot the performance of the
TV method using the WDPC sensing matrix for 8 × 8 and 32 × 32 block sizes. We
see that the rate of performance degradation is significantly reduced for the 32 × 32
block sizes. The improved performance is mainly because the sparsity condition
141
Number of Measurements per Block, M = 1
10
TV
BPDN
Diff
9
8
RMSE
7
6
5
4
3
2
1
0
−30
−20
−10
0
10
20
30
40
50
SNR (in dB)
Figure 6.10: WDPC sensing matrix RMSE vs. SNR curves for the three ℓ1 -based
difference image estimation methods. Block size 8 × 8 and M = 1.
holds better for larger block sizes. If the block size is very small then even a sparse
image might not be sparse within that block. However, with increasing block size
we achieve sparsity and as a result get better performance.
This trend of improved performance with increasing block size augurs well for future practical implementation of our proposed strategies. A practical system based
on the FS imagers discussed in Section 2.2, with the capability to handle different
sensing matrices, could optically measure the entire scene at very fast speeds. An
optical system would not face the same limitations that we have in terms of simulating measurements due to a large scene. Subdividing images into smaller blocks
serves as an important tool for studying the performance and efficacy of different
estimation methods. This has been the primary goal of this work.
142
SNR (in dB) = −10
10
9
9
8
8
7
7
6
6
RMSE
RMSE
SNR (in dB) = −20
10
5
4
PC
PC
WPC
WDPC
GRP
Conventional
3
2
1
0
0
5
10
15
20
PC
PC
WPC
WDPC
GRP
Conventional
5
4
3
2
1
25
0
30
0
Number of Measurements per Block (M)
5
(a) SNR = -20 dB
SNR (in dB) = 0
20
25
30
SNR (in dB) = 40
10
PC
PC
WPC
WDPC
GRP
Conventional
9
8
7
8
7
6
5
4
6
5
4
3
3
2
2
1
1
0
5
10
15
20
25
Number of Measurements per Block (M)
(c) SNR = 0 dB
PC
PC
WPC
WDPC
GRP
Conventional
9
RMSE
RMSE
15
(b) SNR = -10 dB
10
0
10
Number of Measurements per Block (M)
30
0
0
5
10
15
20
25
30
Number of Measurements per Block (M)
(d) SNR = 40 dB
Figure 6.11: RMSE vs. number of measurements per block, for 8 × 8 blocks, for the
TV method
143
SNR (in dB) = −20
16
WDPC (32 by 32 block)
DPC (32 by 32 block)
WDPC (8 by 8 block)
DPC (8 by 8 block)
Conventional
14
RMSE
12
10
8
6
4
0
5
10
15
20
25
30
Number of Measurements per Block (M)
(a) M = 1
Figure 6.12: Performance comparison between ℓ1 -based TV method using 8 × 8 and
32 × 32 blocks.
144
CHAPTER 7
CONCLUSION AND FUTURE WORK
In this dissertation we presented novel advancements in feature-specific (FS) imagers
in large field-of-view surveillence and estimation of temporal object scene changes
utilizing the compressive imaging paradigm.
We first presented a novel technique to continuously track targets in a large FOV
without conventional image reconstruction. The method is based on optical multiplexing of encoded sub-FOVs to create superposition space data that can be used
to decode target positions in object space. We proposed a class of low complexity
multiplexed imagers to perform optical encoding and showed that they can be light
and cheap with simple designs in comparison to wide-FOV conventional imagers.
We discussed different encoding schemes based on spatial shifts, rotations, and magnification with special emphasis on 1-D spatial shift encoding. We showed, based on
both simulation and experimental data, that the proposed method does indeed localize targets in object space and provides continuous target tracking capability. We
have also studied the trade-offs between area coverage efficiency, compression ratio,
and decoding time, and decoding error as a function of shift resolution and SNR.
This study of trade-offs raised the important idea of finding optimal encodings, which
145
needs further investigation. We briefly touched on this in Section 4.2. We showed
that the approximate linear relationship between decoding time and area coverage
efficiency precludes an optimal overlap in the spatial-encoding scheme. However, it
might be possible to develop a hybrid encoding scheme employing a combination
of spatial, rotational and magnification encodings. We briefly discussed a possible
approach in Section 4.2. There might also be other approaches that need to be
investigated.
In the second part of the dissertation we applied the FS paradigm to perform
difference imaging. We presented various FS sensing matrices for noise absent and
noise present cases. These sensing matrices were used to take compressive measurements of the object scene. In conjuction with this compressive measurement scheme
we presented ℓ2 - and ℓ1 -based techniques for estimating a sequence of difference
images from the sequence of compressive measurements . We also presented qualitative and quantitative results to attest that both techniques are able to successfully
estimate the difference images within the FS compressive imaging paradigm. Each
technique has its advantage: ℓ2 -based techniques give closed-form expressions for the
linear estimation operators that are easy to compute. On the other hand, ℓ1 -based
methods exploit the natural sparsity of the difference image. Within ℓ2 -based techniques we looked at multi-step DDIE and LFGDIE methods to directly reconstruct
the difference image from the compressive measurements. The ℓ2 -based techniques’
146
use of spatio-temporal correlation matrices requires the assumption of wide-sense
stationarity, which seldom holds in practice with the possible exception of texture
images. Despite this, there is considerable literature on using second-order statistics
to perform various image processing and imaging tasks [80], [81], [82], [83], [4]. We
too have empirically shown that our proposed techniques yield good performance.
Methods have also been suggested for transforming non-stationary images to exhibit stationary characteristics [84], [85]. Within the compressive imaging paradigm
incorporating non-stationarity into the FS imager would require a sensing matrix
update model that evolves with the temporal object scene. To develop such a model,
however, requires a method to associate the object scene with the measurements.
Due to the structure that the sensing matrix possesses, however, such an association
remains a subject of potential research work.
For the ℓ1 -based estimation problem, we looked at three different approaches
to the linear inverse problem and compared their performance. We found that
the modified TV method performs the best although the method that enforced
the sparsity condition performs only slightly worse. Lastly we found that WDPC
sensing matrix had the lowest RMSE for both ℓ2 - and ℓ1 -based methods, although
for the latter WPC did equally well. The performance of all sensing matrices which
utilized a priori information was better than the non-adaptive GRP sensing matrix.
In fact, from a practical perspective, depending on the SNR and the number of
147
measurements that can be taken, any one of them can be used to provide a decent
input to a tracker.
148
APPENDIX A
DERIVATIONS FOR NOISE ABSENT CASE
In this appendix we derive the difference image estimation operators and the ℓ2 optimal sensing matrix for the “noise absent” case.
Let x1 and x2 be the object scene at two consecutive time instants. Let us also
define the true difference image as
∆x1 = x2 − x1 .
(A.1)
The signal model for the “noise absent” case is given by
y1 = x1 ,
(A.2)
y2 = Φ2 x2 ,
(A.3)
where Φ2 is sensing matrix we want to design. Based on the signal model, the
estimate of the difference image is given by
∆x̂ = W1 y1 + W2 y2 ,
(A.4)
where W1 and W2 are the difference image estimation operators.
As the first step, we express the estimation operators in terms of the sensing
matrix. We then use those expressions to compute the ℓ2 -optimal sensing matrix.
149
Toward that end, we minimize the BMSE
J(W1 , W2 ) = E[||∆x − ∆x̂||2ℓ2 ],
(A.5)
J(W1 , W2 ) = E[tr((∆x − ∆x̂)(∆x − ∆x̂)T )].
(A.6)
re-written as
On substituting (A.4) in (A.6), differentiating the result with respect to W1 and
W2 , and equating both derivatives to zero, we get
W1 = (R21 − R11 − W2 Φ2 R21 )R−1
11 ,
(A.7)
W2 = (R22 − R12 − W1 R12 )R−1
22 ,
(A.8)
T
T
T
where R11 = E[x1 xT
1 ], R12 = E[x1 x2 ] = R21 and R22 = E[x2 x2 ]. On further
simplification we can write the estimation operators as
T −1
−1
W1 = (I − Rδ ΦT
2 (Φ2 Rδ Φ2 ) Φ2 )R21 R11 − I,
T −1
W 2 = R δ ΦT
2 (Φ2 Rδ Φ2 ) ,
(A.9)
(A.10)
where I is an N × N identity matrix and Rδ = R22 − R21 R−1
11 R12 . Substituting
(A.9), (A.10) and (A.4) in (A.6) we get
T −1
T
T −1
T
J(W1 , W2 ) = tr((Rδ ΦT
2 (Φ2 Rδ Φ2 ) Φ2 − I)Rδ (Rδ Φ2 (Φ2 Rδ Φ2 ) Φ2 ) ). (A.11)
Differentiating (A.11) with respect to Φ2 and equating the derivative to zero, we
get
T −1
Φ2 R2δ ΦT
2 (Φ2 Rδ Φ2 ) Φ2 = Φ2 Rδ .
(A.12)
150
For the noise absent case, the ℓ2 -optimal matrix decomposition of Rδ is given by
its eigendecompostion. The decomposition is Rδ = QDQT , where D is the diagonal matrix of eigenvalues arranged in a descending order with the corresponding
eigenvectors as the columns of Q. Substituting this eigendecompostion in (A.12)
and denoting Φ2 Q by Z, we get
ZD2 ZT (ZDZT )−1 Φ2 = ZDQT ,
ZD2 ZT (ZDZT )−1 Φ2 Q = ZD. (∵ Q is a unitary matrix)
(A.13)
Denoting ZD2 ZT (ZDZT )−1 by S we re-write (A.13) as
SZ = ZD.
(A.14)
Note, that S is an M × M matrix and therefore, there are only M eigenvectors.
Consequently, N − M columns of Z must be zero vectors. Let us denote Z by [X 0]
where
M matrix and 0 is an M × (N − M) matrix. Let us also write D
 X is an M × 
0

 DM
, where DM and DN −M are M × M and (N − M) × (N − M)
as 


0 DN −M
diagonal matrices each with M and N − M eigenvalues respectively. If we assume
that Φ2 has full row rank of M, and X is an invertible matrix, then (A.13) is
satisfied. Therefore, using Z = Φ2 Q, the sensing matrix is given by Φ2 = [X 0]QT .
151
APPENDIX B
LFGDIE AND DDIE ESTIMATION OPERATORS
We first derive the difference image estimation operators for LFGDIE method, and
then find the estimation operators for multi-step DDIE method as a special case of
LFGDIE method.
The BMSE between ∆x1L and ∆x̂1L is
J(W1′ , Wp′ , WL′ ) = E[||∆x1L − ∆x̂1L ||2ℓ2 ],
(B.1)
where ∆x1L = xL − x1 and ∆x̂1L = [W′ 1 W′ p W′ L ][y1T ypT yLT ]T as shown in (5.27).
We then re-write (B.1) as
J(W1′ , Wp′ , WL′ ) = E[tr((∆x1L − ∆x̂1L )(∆x1L − ∆x̂1L )T )],
(B.2)
T
= E[tr((∆x1L ∆xT
1L ) − (∆x1L ∆x̂1L )
T
− (∆x̂1L ∆xT
1L ) + (∆x̂1L ∆x̂1L ))].
(B.3)
Note, that with the exception of the first term in (B.3), the last three depend on
152
W1′ , Wp′ , WL′ . Their explicit dependence is respectively given by
T
T
′
T
′
Term 1: E[tr(∆x1L ∆x̂T
1L )] = tr((RL1 − R11 )W 1 + (RLp − R1p )Φp W p
T
′
+ (RLL − R1L )ΦT
L W L ),
(B.4)
′
′
Term 2: E[tr(∆x̂1L ∆xT
1L )] = tr(W 1 (R1L − R11 ) + W p Φp (RpL − Rp1 )
+ W′ L ΦL (RLL − R1L )),
(B.5)
= Term 1, (∵ tr(GT ) = tr(G))
(B.6)
T
T
′
T
′
′
′
Term 3: E[(∆x̂1L ∆x̂T
1L )] = tr(W 1 R11 W 1 + W 1 R1p Φp W p
T
T
′
′
′
+ W′1 R1L ΦT
L W L W 1 R n1 W 1
T
T
′
+ W′p Φp Rp1W′ 1 + W′ p Φp Rpp ΦT
pWp
T
T
′
′
′
+ W′p Φp RpL ΦT
L W L + W p R np W p
T
T
′
+ W′L ΦL RL1 W′1 + W′ L ΦL RLp ΦT
pWp
T
T
′
′
′
+ W′L ΦL RLL ΦT
L W L + W L RnL W L ).
(B.7)
Substituting these three terms in (B.3), differentiating with respect to W1′ , Wp′ and
WL′ , and setting the derivatives equal to zero, we get
W1′ = (RL1 − R11 − W′ p Φp Rp1 − W′L ΦL RL1 )(R11 + Rn1 )−1 ,
(B.8)
−1
T
Wp′ = (RLp − R1p − W′1 R1p − W′ L ΦL RLp )ΦT
p (Φp Rpp Φp + Rnp ) ,
(B.9)
T
−1
WL′ = (RLL − R1L − W′1 R1L − W′p Φp RpL )ΦT
L (ΦL RLL ΦL + RnL ) .
(B.10)
On simplifying (B.8), (B.9) and (B.10) we get the equations (5.29) through (5.39).
153
To get the expressions (5.13) through (5.16) for the multi-step DDIE method we
simply set L = 2, resulting in p = 0, RLγL = Rα , RαL = Rα − Rβ , W′ p = 0,
W′ 1 = W1 and W′ L = W2 . Note, that with p = 0 all correlation matrices with p
in the subscript go to 0.
154
APPENDIX C
WATERFILLING SOLUTION
We first calculate the maximum mutual information between x and y given the
sensing matrix Φwpc , and then use it to compute the weights wi , i = 1, . . . , M.
I(x; y) = h(y) − h(y|x),
(elements of x and y take on a continuum of values.)
= h(y) − h(n),
M |Ry |
, (log is w.r.t base 2),
= log (2 ∗ π ∗ e)
|Rn |
|Ry |
M
= log (2 ∗ π ∗ e) + log
,
|Rn|
(C.1)
(C.2)
where h represents differential entropy and | · | denotes the determinant of the correlation matrices Rx and Ry . Note that entropy is maximized when, for a given
covariance, y is a multi-variate Gaussian. This is achieved for a multi-variate Gaussian input. (We already know that noise is AWGN.) In our case, we know that the
object scene is not Gaussian and so we will get a sub-optimal solution. Yet, as shown
in Chapter 6, we do see the benefit of adapting the principal components according
to the given SNR. Here we complete the proof assuming the optimal scenario of a
multi-variate Gaussian input.
155
From (C.2), the problem of maximizing the mutual information is reduced to
|Ry |
maximizing log |Rn | . We have
log
|Ry |
|Rn |
|Φwpc Rx ΦT
wpc + λn IM |
|λn IM |
= log
= log
!
, (Here λn is the noise variance σn2 ),
|Dw (U(1 : N, 1 : M))T Rx (U(1 : N, 1 : M))Dw T + λn IM |
|λn IM |
,
(∵ Φwpc = Dw (U(1 : N, 1 : M))T ),
= log
|Dw (U(1 : N, 1 : M))T UΛUT (U(1 : N, 1 : M))Dw + λn IM |
|λn IM |
,
(∵ Rx = UΛUT , Dw T = Dw ),
|Dw I(M,N ) ΛI(N,M ) Dw + λn IM |
, (I(M,N ) = [IM |0(M,N −M ) ]),
= log
|λn IM |
|Dw ΛM Dw + λn IM |
= log
,
|λn IM |
(ΛM is a diagonal matrix with the first M eigenvalues),
2
M w i λi + λn
= log Πi=1
,
λn
2
w i λi + λn
M
.
(C.3)
= Σi=1 log
λn
Our goal now has further reduced to maximizing (C.3). But we have a constraint
√
2
N
given by ΣM
i=1 wi = Σi=1 λi = E. What this constraint says is that the norm of the
weights must not exceed the total energy available.1 Using lagrange multipliers our
1
We perform non-coherent imaging.
156
objective is written as
2
J(w12 , . . . , wM
)
wi2 λi + λn
2
− ζ(ΣM
=
log
i=1 wi − E),
λn
2
w i λi + λn
E
M
2
= Σi=1 log
− ζ(wi − ) .
λn
M
ΣM
i=1
(C.4)
From (C.4) we see that we can maximize each summation term individually.
2
w λ +λ
To maximize the ith (i = 1, . . . , M) term, we differentiate J(wi2 ) = log i λin n −
ζ(wi2 −
E
)
M
with respect to wi2 to get
wi2 =
1 λn
− .
ζ
λi
(C.5)
Note that wi2 cannot be negative. Therefore, we can rewrite (C.4) as
wi2 = max[0,
1 λn
− ],
ζ
λi
(C.6)
or
+
1 λn
=
−
.
ζ
λi
+
λn
We choose the value of ζ such that ΣM
ζ
−
= E.
i=1
λi
wi2
(C.7)
157
REFERENCES
[1] D. Whitehouse, “Worldś oldest telescope?”
[Online].
http://news.bbc.co.uk/2/hi/science/nature/380186.stm
Available:
[2] D. J. Brady, Optical Imaging and Spectroscopy. John Wiley & sons, 2009.
[3] S. Tucker, W. T. Cathey, and J. Edward Dowski, “Extended depth of
field and aberration control for inexpensive digital microscope systems,”
Opt. Express, vol. 4, no. 11, pp. 467–474, 1999. [Online]. Available:
http://www.opticsexpress.org/abstract.cfm?URI=oe-4-11-467
[4] H. S. Pal, D. Ganotra, and M. A. Neifeld, “Face recognition by using
feature-specific imaging,” Appl. Opt., vol. 44, no. 18, pp. 3784–3794, 2005.
[Online]. Available: http://ao.osa.org/abstract.cfm?URI=ao-44-18-3784
[5] M. A. Neifeld and P. Shankar, “Feature-specific imaging,” Appl.
Opt., vol. 42, no. 17, pp. 3379–3389, 2003. [Online]. Available:
http://ao.osa.org/abstract.cfm?URI=ao-42-17-3379
[6] S. Prasad, “Information capacity of a seeing-limited imaging system,” Optics
Communications, vol. 177, no. 1-6, pp. 119 – 134, 2000.
[7] E. Clarkson and H. H. Barrett, “Approximations to ideal-observer performance
on signal-detection tasks,” Appl. Opt., vol. 39, no. 11, pp. 1783–1793, 2000.
[Online]. Available: http://ao.osa.org/abstract.cfm?URI=ao-39-11-1783
[8] W.-C. Chou, M. A. Neifeld, and R. Xuan, “Information-based optical design
for binary-valued imagery,” Appl. Opt., vol. 39, no. 11, pp. 1731–1742, 2000.
[Online]. Available: http://ao.osa.org/abstract.cfm?URI=ao-39-11-1731
[9] A. Ashok and M. Neifeld, “Information-based analysis of simple incoherent
imaging systems,” Opt. Express, vol. 11, no. 18, pp. 2153–2162, 2003. [Online].
Available: http://www.opticsexpress.org/abstract.cfm?URI=oe-11-18-2153
[10] M. P. Christensen, G. W. Euliss, M. J. McFadden, K. M. Coyle, P. Milojkovic,
M. W. Haney, J. van der Gracht, and R. A. Athale, “Active-eyes: an adaptive
pixel-by-pixel image-segmentation sensor architecture for high-dynamic-range
hyperspectral imaging,” Appl. Opt., vol. 41, no. 29, pp. 6093–6103, 2002.
[Online]. Available: http://ao.osa.org/abstract.cfm?URI=ao-41-29-6093
158
[11] D. L. Marks, R. Stack, A. J. Johnson, D. J. Brady, and D. C.
Munson, “Cone-beam tomography with a digital camera,” Appl.
Opt., vol. 40, no. 11, pp. 1795–1805, 2001. [Online]. Available:
http://ao.osa.org/abstract.cfm?URI=ao-40-11-1795
[12] Z. Liu, M. Centurion, G. Panotopoulos, J. Hong, and D. Psaltis, “Holographic
recording of fast events on a ccd camera,” Opt. Lett., vol. 27, no. 1, pp. 22–24,
2002. [Online]. Available: http://ol.osa.org/abstract.cfm?URI=ol-27-1-22
[13] P. Potuluri, M. Fetterman, and D. Brady, “High depth of field
microscopic imaging using an interferometric camera,” Opt. Express, vol. 8, no. 11, pp. 624–630, 2001. [Online]. Available:
http://www.opticsexpress.org/abstract.cfm?URI=oe-8-11-624
[14] P. K. Baheti and M. A. Neifeld, “Feature-specific structured imaging,”
Appl. Opt., vol. 45, no. 28, pp. 7382–7391, 2006. [Online]. Available:
http://ao.osa.org/abstract.cfm?URI=ao-45-28-7382
[15] M. A. Neifeld and J. Ke, “Optical architectures for compressive imaging,” Applied Optics, vol. 46, pp. 5293–5303, 2007.
[16] M. B. Wakin, J. N. Laska, M. F. Duarte, D. Baron, S. Sarvotham, D. Takhar,
K. F. Kelly, and R. G. Baraniuk, “An architecture for compressive imaging,”
in IEEE International Conference on Image Processing, 2006, pp. 1273–1276.
[17] E. J. Candès and T. Tao,
“The dantzig selector:
statistical estimation when p is much larger than n,”
Annals
of Statistics, vol. 35, pp. 2313–2351, 2005. [Online]. Available:
http://www-stat.stanford.edu/∼candes/papers/DantzigSelector.pdf
[18] ——, “Decoding by linear programming,” IEEE Transactions on Information
Theory, vol. 51, no. 12, pp. 4203–4215, 2005. [Online]. Available:
http://doi.ieeecomputersociety.org/10.1109/TIT.2005.858979
[19] A. C. Gilbert, S. Muthukrishnan, and M. Strauss, “Improved time bounds for
near-optimal sparse fourier representations,” in in Proc. SPIE Wavelets XI,
M. Papadakis, A. F. Laine, and M. A. Unser, Eds., vol. 5914, 2003.
[20] E. J. Candès, J. K. Romberg, and T. Tao, “Robust uncertainty principles: exact
signal reconstruction from highly incomplete frequency information,” IEEE
Transactions on Information Theory, vol. 52, no. 2, pp. 489–509, 2006. [Online].
Available: http://doi.ieeecomputersociety.org/10.1109/TIT.2005.862083
159
[21] E. J. Candès and T. Tao, “Near-optimal signal recovery from random
projections: Universal encoding strategies?”
IEEE Transactions on
Information Theory, vol. 52, no. 12, pp. 5406–5425, 2006. [Online]. Available:
http://doi.ieeecomputersociety.org/10.1109/TIT.2006.885507
[22] E. J. Candès, J. K. Romberg, and T. Tao, “Signal recovery from incomplete
and inaccurate measurements,” Comm. Pure Appl. Math., vol. 59, no. 8, pp.
1207–1223, 2005.
[23] E. J. Candès and J. K. Romberg, “Quantitative robust uncertainty principles
and optimally sparse decompositions,” Foundations of Computational
Mathematics, vol. 6, no. 2, pp. 227–254, 2006. [Online]. Available:
http://dx.doi.org/10.1007/s10208-004-0162-x
[24] D. L. Donoho, “Compressed sensing,” Sept. 2004.
[25] D. L. Donoho and X. Huo, “Uncertainty principles and ideal atomic decomposition,” IEEE Transactions on Information Theory, vol. 47, no. 7, pp. 2845–2862,
2001.
[26] D. L. Donoho and M. Elad, “Optimally sparse representation in general
(nonorthogonal) dictionaries via ℓ1 -minimization,” Proc. Natl. Acad. Sci. USA,
vol. 100, pp. 2197–2202, 2003.
[27] M. Elad and A. M. Bruckstein, “A generalized uncertainty principle and sparse
representation in pairs of bases,” IEEE Transactions on Information Theory,
vol. 48, no. 9, pp. 2558–2567, 2002.
[28] R. Gribonval and M. Nielsen, “Sparse representations in unions of bases,” IEEE
Transactions on Information Theory, vol. 49, no. 12, pp. 3320–3325, 2003.
[29] A. Feuer and A. Nemirovski, “On sparse representation in pairs of bases,”
Information Theory, IEEE Transactions on, vol. 49, no. 6, pp. 1579 – 1581,
june 2003.
[30] J. Fuchs, “On sparse representations in arbitrary redundant bases,” Information
Theory, IEEE Transactions on, vol. 50, no. 6, pp. 1341 – 1344, june 2004.
[31] D. L. Donoho and I. M. Johnstone, “Minimax estimation via wavelet shrinkage,” Ann. Statist., vol. 26, no. 3, pp. 879–921, 1998.
[32] A. Zandi, J. D. Allen, E. L. Schwartz, , and M. Boliek, “CREW: Compression with reversible embedded wavelets,” in Proc. of IEEE Data Compression
Conference, March 1995, pp. 212–221.
160
[33] M. Boliek, M. Gormish, E. L. Schwartz, and A. F. Keith, “Decoding compression with reversible embedded wavelets (CREW) codestreams,” Journal of
Electronic Imaging, vol. 7, no. 3, pp. 402–409, 1998.
[34] M. Lustig, D. L. Donoho, and J. M. Pauly, “Sparse MRI: The application of
compressed sensing for rapid MR imaging,” Magnetic Resonance in Medicine,
2007.
[35] M. Herman and T. Strohmer, “High-resolution radar via compressed sensing,”
Signal Processing, IEEE Transactions on, vol. 57, no. 6, pp. 2275 –2284, june
2009.
[36] M. Elad, “Optimized projections for compressed sensing,” IEEE Trans. Signal
Process., vol. 55, no. 12, pp. 5695–5702, 2007.
[37] J. M. D. Carvajalino and G. Sapiro, “Learning to sense sparse signals: simultaneous sensing matrix and sparsifying dictionary optimization,” IEEE Trans.
Image Process., vol. 18, no. 7, pp. 1395–1408, 2009.
[38] D. W. Xue and Z. M. Lu, “Difference image watermarking based reversible image authentication with tampering localization capability,” International Journal of Computer Sciences and Engineering Systems, vol. 2, pp. 219–226, 2008.
[39] S. K. Lee, Y. H. Suh, and Y. S. Ho, “Reversible image authentication based
on watermarking,” in IEEE International Conference on Multimedia & Expo,
2006, pp. 1321–1324.
[40] P. B. Heffernan and R. A. Robb, “Difference image reconstruction from a few
projections for nondestructive materials inspection,” Applied Optics, vol. 24,
pp. 4105–4110, 1985.
[41] S. G. Kong, “Classification of interframe difference image blocks for video compression,” in Proceedings of SPIE, vol. 4668, 2002, pp. 29–37.
[42] V. Cehver, A. Sankaranarayanan, M. F. Duarte, D. Reddy, R. G. Baraniuk,
and R. Chellappa, “Compressive sensing for background subtraction,” in Proc.
10th European Conf. Comp. Vision, 2008, pp. 155–168.
[43] D. Hahn, V. Daum, J. Hornegger, W. Bautz, and T. Kuwert, “Difference imaging of inter- and intra-ictal spect images for the localization of seizure onset in
epilepsy,” in IEEE Nuclear Science Symposium Conference Record, 2007, pp.
4331–4335.
161
[44] C. Alcock, R. A. Allsman, D. Alves, T. S. Axelrod, A. C. Becker, D. P. Bennett, K. H. Cook, A. J. Drake, K. C. Freeman, K. Griest, M. J. Lehner, S. L.
Marshall, D. Minniti, B. A. Peterson, M. R. Pratt, P. J. Quinn, C. W. Stubbs,
W. Sutherland, A. Tomaney, T. Vandehei, and D. L. Welch, “Difference image
analysis of galactic microlensing. i. data analysis,” Astrophys. J., vol. 521, pp.
602–612, 1999.
[45] L. Bruzzone and D. F. Prieto, “Automatic analysis of the difference image for
unsupervised change detection,” IEEE Trans. Geosci. Remote Sens., vol. 8, pp.
1171–1182, 2000.
[46] S. M. Kay, Fundamentals of Statistical Processing, Volume I: Estimation Theory. Prentice Hall, 1993.
[47] J. I. ’t Zand, Ph.D. dissertation, University of Utrecht, 1992.
[48] E. E. Fenimore and T. M. Cannon, “Coded aperture imaging with uniformly
redundant arrays,” Appl. Opt., vol. 17, no. 3, pp. 337–347, 1978. [Online].
Available: http://ao.osa.org/abstract.cfm?URI=ao-17-3-337
[49] E. E. Fenimore, “Time-resolved and energy-resolved coded aperture images
with ura tagging,” Appl. Opt., vol. 26, no. 14, pp. 2760–2769, 1987. [Online].
Available: http://ao.osa.org/abstract.cfm?URI=ao-26-14-2760
[50] M. Sims, M. Turner, and R. Willingale, “Wide field x-ray camera,” Space Science Instrumentation, vol. 5, pp. 109–127, 1980.
[51] R. Willingale, M. Sims, and M. Turner, “Advanced deconvolution techniques
for coded aperture imaging,” Nucl. Instr. Methods Phys. Res., vol. 60, p. 221,
1984.
[52] B. R. Frieden and J. J. Burke, “Restoring with maximum entropy, ii:
Superresolution of photographs of diffraction-blurred impulses,” J. Opt.
Soc. Am., vol. 62, no. 10, pp. 1202–1210, 1972. [Online]. Available:
http://www.opticsinfobase.org/abstract.cfm?URI=josa-62-10-1202
[53] G. Daniell, “Image restoration and processing methods,” Nucl. Instr. Methods
Phys. Res., vol. 67, p. 221, 1984.
[54] S. R. Gottesman and E. E. Fenimore, “New family of binary arrays for coded
aperture imaging,” Appl. Opt., vol. 28, no. 20, pp. 4344–4352, 1989. [Online].
Available: http://ao.osa.org/abstract.cfm?URI=ao-28-20-4344
162
[55] M. D. Stenner, P. Shankar, and M. A. Neifeld, “Wide-field feature-specific
imaging,” Frontiers in Optics, 2007.
[56] D. J. Brady, “Micro-optics and megapixels,” Optics and Photonics News,
vol. 17, pp. 24–29, 2006.
[57] D. Du and F. Hwang, Combinatorial group testing and its applications, ser.
Series on Applied Mathematics. World Scientific, 2000, vol. 12.
[58] C. M. Brown, “Multiplex imaging with random arrays,” Ph.D. dissertation,
Univ. of Chicago, 2000.
[59] D. J. Brady and M. E. Gehm, “Compressive imaging spectrometers using coded
apertures,” Proc. SPIE, vol. 6246, 2006.
[60] R. H. Dicke, “Scatter-hole cameras for x-rays and gamma rays,” Astrophys J.,
vol. 153, pp. L101–L106, 1968.
[61] S. Uttam, N. A. Goodman, M. A. Neifeld, C. Kim, R. John, J. Kim, and
D. Brady, “Optically multiplexed imaging with superposition space tracking,”
Opt. Express, vol. 17, no. 3, pp. 1691–1713, 2009. [Online]. Available:
http://www.opticsexpress.org/abstract.cfm?URI=oe-17-3-1691
[62] A. Biswas, P. Guha, A. Mukerjee, and K. S. Venkatesh, “Intrusion detection
and tracking with pan-tilt cameras,” IET Intl. Conf. VIE 06, pp. 565–571,
2006.
[63] A. W. Senior, A. Hampapur, and M. Lu, “Acquiring multi-scale images by pan-tilt-zoom control and automatic multi-camera calibration,”
WACV/MOTION’05, pp. 433–438, 2005.
[64] E. Voigtman and J. D. Winefordner, “The multiplex disadvantage and excess
low-frequency noise,” Appl. Spectrosc., vol. 41, no. 7, pp. 1182–1184, 1987.
[Online]. Available: http://as.osa.org/abstract.cfm?URI=as-41-7-1182
[65] C. R. Wren, A. Azarbayejani, T. J. Darrell, and A. P. Pentland, “Pfinder: Realtime tracking of the human body,” IEEE Trans. Pattern Anal. Mach. Intell,
vol. 19, pp. 780–785, 1997.
[66] N. Friedman and S. J. Russell, “Image segmentation in video sequences:a probabilistic approach,” Proc. Uncertainty Artif. Intell. Conf., pp. 175–181, 1997.
[67] C. Stauffer and E. L. Grimson, “Learning patterns of activity using real-time
tracking,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 22, pp. 747–757, 2000.
163
[68] A. Mittal and N. Paragios, “Motion-based background subtraction using adaptive kernel density estimation,” Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 302–309, 2004.
[69] K. P. Karmann and A. Brandt, Moving Object Recognition Using an Adaptive
Background Memory, V. Cappellini, Ed. Elsevier, 1990.
[70] P. Jodoin, M. Mignotte, and J. Konrad, “Statistical background subtraction
using spatial cues,” IEEE Circuits Syst. Video Technol., vol. 17, pp. 1758–1763,
2007.
[71] R. Singh, “Advanced correlation filters for multi-class synthetic aperture radar
detection and classification,” Master’s thesis, Carnegie Mellon University, 2002.
[72] M. Alkanhal and B. V. K. V. Kumar, “Polynomial distance classifier correlation
filter for pattern recognition,” Appl. Opt., vol. 42, pp. 4688–4708, 2003.
[73] B. V. K. V. Kumar, D. W. Carlson, and A. Mahalanobis, “Optimal trade-off
synthetic discriminant function filters for arbitrary devices,” Optics Letters,
vol. 19, pp. 1556–1558, 1994.
[74] B. V. K. V. Kumar, “Minimum-variance synthetic discriminant functions,” J.
Opt. Soc. Am. A, vol. 3, pp. 1579–1584, 1986.
[75] T. M. Cover and J. A. Thomas, Elements of Information Theory. John Wiley
& Sons Inc., 2006.
[76] W. Wenzel and K. Hamacher, “A stochastic tunneling approach for global minimization,” Phys. Rev. Lett., vol. 82, pp. 3003–3007, 1999.
[77] S. Osher, L. Rudin, and E. Fatem, “Nonlinear total variation based noise removal algorithms,” Physica D., vol. 60, pp. 259–268, 1992.
[78] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by
basis pursuit,” SIAM Review, vol. 43, pp. 129–159, 2001.
[79] I. Daubechies, Ten Lectures on Wavelets. SIAM, 1992.
[80] L. Malagón-Borja and O. Fuentes, “An object detection system using image
reconstruction with pca,” Computer and Robot Vision, Canadian Conference,
pp. 2–8, 2005.
[81] M. Turk and A. Pentland, “Face recognition using eigenfaces,” IEEE Conf. on
Computer Vision and Pattern Recognition, pp. 586–591, 1991.
164
[82] A. Pentland, B. Moghaddam, and T. Starner, “View-based and modular
eigenspaces for face recognition,” MIT, Tech. Rep., 1994.
[83] H. Moon and P. Phillips, “Computational and performance aspects of pca-based
face recognition algorithms,” Perception, vol. 10, pp. 303–321, 2001.
[84] A. Hillery. and R. Chin, “Restoration of images with nonstationary mean and
autocorrelation,” in ICASSP-88, vol. 2, Apr 1988, pp. 1008–1011.
[85] R. N. Strickland, “Transforming images into block stationary behavior,” Appl.
Opt., vol. 22, no. 10, pp. 1462–1473, 1983.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement