COMPRESSIVE IMAGING FOR DIFFERENCE IMAGE FORMATION AND WIDE-FIELD-OF-VIEW TARGET TRACKING by Shikhar Copyright © Shikhar 2010 A Dissertation Submitted to the Faculty of the DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING In Partial Fulfillment of the Requirements For the Degree of DOCTOR OF PHILOSOPHY In the Graduate College THE UNIVERSITY OF ARIZONA 2010 2 THE UNIVERSITY OF ARIZONA GRADUATE COLLEGE As members of the Dissertation Committee, we certify that we have read the dissertation prepared by Shikhar entitled Compressive Imaging for Difference Image Formation and Wide-Field-ofView Target Tracking and recommend that it be accepted as fulfilling the dissertation requirement for the Degree of Doctor of Philosophy. Date: September 07, 2010 Dr. Nathan A. Goodman Date: September 07, 2010 Dr. Michael Gehm Date: September 07, 2010 Dr. Bane Vasić Date: September 07, 2010 Date: September 07, 2010 Final approval and acceptance of this dissertation is contingent upon the candidate’s submission of the final copies of the dissertation to the Graduate College. I hereby certify that I have read this dissertation prepared under my direction and recommend that it be accepted as fulfilling the dissertation requirement. Date: September 07, 2010 Dissertation Director: Dr. Nathan A. Goodman 3 STATEMENT BY AUTHOR This dissertation has been submitted in partial fulfillment of requirements for an advanced degree at the University of Arizona and is deposited in the University Library to be made available to borrowers under rules of the Library. Brief quotations from this dissertation are allowable without special permission, provided that accurate acknowledgment of source is made. Requests for permission for extended quotation from or reproduction of this manuscript in whole or in part may be granted by the copyright holder. SIGNED: Shikhar 4 ACKNOWLEDGEMENTS I am immensely grateful to my advisor, Dr. Nathan Goodman, for being an exceptional mentor. His guidance, patience and support have greatly helped me in my research work. Discussions and talks with him on varied research topics have taught me how to approach new research problems. For that, I am beholden to him. My warm thanks to Hyoung-soo Kim, Junhyeong Bae and Ric Romero, with whom I shared our lab over the course of my PhD. The lab environment they created was a joy to work in. We shared many experiences, and also a great many laughs over the idiosyncracies of the research world. It has been my pleasure to be their friend and colleague. A very special thanks to Dr. Mark Neifeld for his helpful insights and suggestions regarding my research work. They greatly helped me in achieving my research objectives. It was a tremendous learning experience to work with him, and I am thankful to him for his guidance and help. Sincere thanks also to Dr. Michael Gehm and Dr. Bane Vasić for taking time away from their busy schedule to be on my committee. They both have always been available whenever I needed their help in research or other matters. I am very grateful. Finally, I would like to thank my friends and family, especially my Dad and my sister, for their love, encouragement and support. This dissertation would not have possible without them. 5 DEDICATION ... to Papa and Sansu ... 6 TABLE OF CONTENTS LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 CHAPTER 1 INTRODUCTION . . . . . . . . . . . 1.1 Compressive Imaging . . . . . . . . . . . . . . 1.2 Compressed Sensing . . . . . . . . . . . . . . 1.3 Our Contribution . . . . . . . . . . . . . . . . 1.3.1 Optical Multiplexing For Superposition 1.3.2 Feature-Specific Difference Imaging . . 1.4 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Space Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 14 18 24 25 29 33 CHAPTER 2 OPTICAL ARCHITECTURES . . . . . . . . . . . . . . . . . 34 2.1 Optical Multiplexing . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.2 Compressive Feature Specific Imagers . . . . . . . . . . . . . . . . . . 43 CHAPTER 3 SUPERPOSITION SPACE TRACKING . . . . . . . 3.1 Sub-FOV Superposition and Encoding . . . . . . . . . . . . 3.1.1 2-D Spatial, Rotational and Magnification Encodings 3.2 Decoding: Proof of Concept . . . . . . . . . . . . . . . . . . 3.2.1 Correlation-based Tracking . . . . . . . . . . . . . . . 3.2.2 Decoding Procedure for a Single Target . . . . . . . . 3.2.3 Decoding Procedure for Multiple Targets . . . . . . . 3.2.4 Decoding via Missing Ghosts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 50 56 59 60 69 73 76 CHAPTER 4 SUPERPOSITION SPACE TRACKING RESULTS 4.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Average Decoding Time and Area Coverage Efficiency (η) 4.3 Probability of Decoding Error . . . . . . . . . . . . . . . . 4.3.1 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 80 87 91 95 CHAPTER 5 FEATURE-SPECIFIC DIFFERENCE IMAGING 5.1 Linear Reconstruction . . . . . . . . . . . . . . . . . . . 5.1.1 Data Model . . . . . . . . . . . . . . . . . . . . . 5.1.2 Indirect Image Reconstruction . . . . . . . . . . . 5.1.3 Direct Difference Image Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 99 99 100 102 . . . . . 7 TABLE OF CONTENTS – Continued 5.2 5.1.4 DDIE: Noise Absent 5.1.5 DDIE: Noise Present 5.1.6 Sensing Matrices . . 5.1.7 Multi-step DDIE and Non-linear Reconstruction . . . . . . . . . . . . . . . . . . . LFGDIE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 105 106 113 118 CHAPTER 6 FEATURE-SPECIFIC DIFFERENCE IMAGING RESULTS . 125 6.1 ℓ2 -based Difference Image Estimation . . . . . . . . . . . . . . . . . . 128 6.2 ℓ1 -based Difference Image Estimation . . . . . . . . . . . . . . . . . . 135 CHAPTER 7 CONCLUSION AND FUTURE WORK . . . . . . . . . . . . 144 APPENDIX A DERIVATIONS FOR NOISE ABSENT CASE . . . . . . . . 148 APPENDIX B LFGDIE AND DDIE ESTIMATION OPERATORS . . . . . 151 APPENDIX C WATERFILLING SOLUTION . . . . . . . . . . . . . . . . . 154 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 8 LIST OF FIGURES 1.1 1.2 1.3 Conventional and FS imagers . . . . . . . . . . . . . . . . . . . . . . 16 Optical multiplexing . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Difference image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.1 2.2 2.3 2.4 2.5 Beamsplitter and mirror based optical system Binary combiner . . . . . . . . . . . . . . . . Sequential feature specific imager . . . . . . . Parallel feature specific imager . . . . . . . . . Photon sharing feature specific imager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 39 45 45 48 3.1 3.2 3.3 3.4 3.5 3.6 Object, superposition and hypothesis spaces Sub-FOV overlap and separation distance . . Encoding schemes . . . . . . . . . . . . . . . Ambiguous target decoding scenarios . . . . Error in target decoding . . . . . . . . . . . Decoding via missing ghosts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 55 56 72 77 78 4.1 4.2 4.2 4.2 4.3 4.4 4.5 Examples of valid target patterns . . . . . . . . . . . . Decoding 4 targets . . . . . . . . . . . . . . . . . . . . Decoding 4 targets contd. . . . . . . . . . . . . . . . . Decoding 4 targets contd. . . . . . . . . . . . . . . . . Plot of decoding time as a function of the area coverage Plot of decoding error vs. area coverage efficiency η . . Experimental data performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . efficiency η . . . . . . . . . . . . . . . . . . . . . 81 83 84 85 88 94 97 5.1 5.2 5.3 Direct and indirect reconstruction . . . . . . . . . . . . . . . . . . . . 103 Stochastic Tunneling . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Multi-step direct difference image estimation . . . . . . . . . . . . . . 114 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 ℓ2 -based estimated difference image . . . . . . . . . . . . RMSE vs. SNR plots for 8 × 8 block size . . . . . . . . . RMSE vs. M for SNR = 20 dB for 8 × 8 block size . . . Single step vs. multi-step DDIE . . . . . . . . . . . . . . RMSE vs. SNR plots for LFGDIE method . . . . . . . . RMSE vs. SNR performance plots for block size 16 × 16 Optimal sensing matrix performance . . . . . . . . . . . ℓ1 -based estimated difference image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 129 130 133 134 135 136 137 9 LIST OF FIGURES – Continued 6.9 6.10 6.11 6.12 RMSE vs. SNR curves for 8 × 8 blocks, for the TV method . ℓ1 -based WDPC sensing matrix RMSE vs. SNR curves . . . RMSE vs. M per block, for 8 × 8 blocks, for the TV method TV method performance for different sensing matrix . . . . . . . . . . . . . . . . . . . . . . . . 139 141 142 143 10 ABSTRACT Use of imaging systems for performing various situational awareness tasks in military and commercial settings has a long history. There is increasing recognition, however, that a much better job can be done by developing non-traditional optical systems that exploit the task-specific system aspects within the imager itself. In some cases, a direct consequence of this approach can be real-time data compression along with increased measurement fidelity of the task-specific features. In others, compression can potentially allow us to perform high-level tasks such as direct tracking using the compressed measurements without reconstructing the scene of interest. In this dissertation we present novel advancements in feature-specific (FS) imagers for large field-of-view surveillence, and estimation of temporal object-scene changes utilizing the compressive imaging paradigm. We develop these two ideas in parallel. In the first case we show a feature-specific (FS) imager that optically multiplexes multiple, encoded sub-fields of view onto a common focal plane. Sub-field encoding enables target tracking by creating a unique connection between target characteristics in superposition space and the target’s true position in real space. This is accomplished without reconstructing a conventional image of the large field of view. System performance is evaluated in terms of two criteria: average decoding time 11 and probability of decoding error. We study these performance criteria as a function of resolution in the encoding scheme and signal-to-noise ratio. We also include simulation and experimental results demonstrating our novel tracking method. In the second case we present a FS imager for estimating temporal changes in the object scene over time by quantifying these changes through a sequence of difference images. The difference images are estimated by taking compressive measurements of the scene. Our goals are twofold. First, to design the optimal sensing matrix for taking compressive measurements. In scenarios where such sensing matrices are not tractable, we consider plausible candidate sensing matrices that either use the available a priori information or are non-adaptive. Second, we develop closed-form and iterative techniques for estimating the difference images. We present results to show the efficacy of these techniques and discuss the advantages of each. 12 CHAPTER 1 INTRODUCTION The history of optics is long and eventful, going back to Assyrians in seventh century B.C. Mesopotamia [1]. In the intervening centuries development of optics went through two major advancements. The first was the development of man-made optical elements such as lenses to enhance human vision. The second was the recording of image formation on a photochemical film. Both these advancements have had immense impact on various disciplines such as medicine (mammography), astronomy (telescope) and photography to name a few, and on modern life in general. We are now in the era of the third major advancement, the replacement of photochemical films with opto-electronic sensors, and the ubiquitous use of digital signal processing [2]. Until recently, the imaging and the signal processing components of the optical system were treated as separate. The goal of the imaging components was to image the object scene of interest. Towards this end imaging components were optimized using optical methods to improve resolution, reduce aberrations and in general obtain a good quality image of the object scene. Signal and image processing methods, along with higher level computer vision techniques, were then used to remove effects 13 of sensors and residual optical aberrations to produce an improved image, and to extract relevant image features to accomplish an analysis or situational awareness task at hand. To be sure, this is still a major area of research. However, Hounsfield’s development of the X-ray tomographic scanner in 1972, and the development of nuclear magnetic resonance spectroscopic techniques in the later half of the twentieth century – which eventually led to the invention of magnetic resonanace imaging (MRI) during the 1970s – has influenced a growing trend of optical design informed by digital signal processing ideas. More and more imaging and signal processing components of the system are being jointly designed to exploit the extra degrees of freedom for better performance [2], [3]. This is the realm of computational imaging. The key to computational imaging is the discrete representation of data, which is inherent in the use of opto-electronic sensors. This discrete analytical approach allows us to think of computational imaging as mapping discretized object space data to some measurement space. Depending on the nature of the mapping we get three broad areas in computational imaging [2]: 1. Isomorphic or one-to-one mapping resulting in measurement space being an image of the object space. 2. Dimension-preserving mapping, such as Fourier transform, where the dimensions of the space in which the measurements are embedded and the dimensions of the native object space is the same. 3. Discrete mapping where the object data is linearly or non-linearly projected on to some measurement space, with no assumptions about 14 the underlying embedded space. It is in the context of this last category, that we define compressive imaging. 1.1 Compressive Imaging Compressive imaging involves linearly mapping a higher-dimensional object space to a lower-dimensional measurement space leading to real-time compression of data.1 Furthermore, the mapping is based on some task-specific metric and designed to improve measurement fidelity and system performance. This approach can be contrasted with the traditional approach of conventional imaging where the goal is to obtain a high-fidelity picture and then extract relevant information as a postprocessing step. Figure 1.1 illustrates the contrast between the two approaches. Figure 1.1(a) shows the conventional imaging paradigm where a traditional optical system images the object scene of interest onto a film or an optical sensor. (In keeping with the discrete representation we assume it to be a photodetector.) The image is then processed for the task at hand. For example, in face recognition, independent features2 can be extracted from a set of training images, imaged using a traditional camera. The point of note here is that imaging and computational image processing are implemented as two distinct processes cascaded together. A 1 We make the notion of compression precise in Section 2.2. 2 The independent features are computed using independent component analysis technique. 15 considerable work has been done to separately improve the performance of each of these two stages. This, however, has now begun to change. The idea depicted in Fig. 1.1(b) is to replace the conventional imager with the feature-specific (FS) imager, which instead of imaging the object scene, measures features of the object scene specific to a particular task at hand. Taking the face recognition example, the FS imager directly measures the independent facial features. By controlling the number of features measured as a function of the performance of the face recognition task at hand, it turns out that we need to measure fewer features than the native dimensionality of the object scene, resulting in a real-time compression of data [4]. This directly results in reduced sensor costs. The second, and far more surprising benefit is the improved measurement fidelity,3 and consequently, improved face recognition performance, due to the incidence of the same number of object scene photons as in the conventional scenario on a smaller set of photo-detectors. The history of FS imager-based compressive imaging is quite recent with its beginnings lying in this past decade. The work by Neifeld and Shankar [5] was the first formalization of this idea. The work itself was motivated by earlier work in computational imaging related to hardware-software co-design [3], development of information metrics [6], [7], [8], [9], and non-traditional [10] and novel imaging systems [11], [12], [13]. Compressive imaging since then has been used for designing a 3 This is the detector-noise-limited fidelity. 16 (a) Conventional imager (b) FS imager Figure 1.1: Conventional and FS imagers face recognition optical system [4], and FS imagers using spatial structured light. FS imaging with structured illumination can give 40% improvement in measurement fidelity, while at the same achieving compression by a factor of 400 [14]. A number of compressive imaging architectures have also been proposed with their advantages and disadvantages studied in detail [15]. The key to this historical development, as must have been deduced by the reader, has been the utilization of knowledge-enhanced measurements, that is, incorporating a priori knowledge to design compressive FS imaging systems. For example, if there is a need for data compression, then the measured features can be wavelet features. On the other hand, if data encoding/decoding is required then Hadamard features can be used. A closely related approach to FS compressive imaging, which has gained wide- 17 usage, stresses on making non-adaptive compressive scene measurements. This approach is referred to as compressed sensing (CS). The thrust of CS is slightly different to FS compressive imaging. Here the focus is on designing a universal imager that does not utilize any a priori information about the scene, except for requiring the scene be sparse in some representation basis. One of its major successes has been the development of a single pixel camera [16]. CS ideas have also been used in various other fields such as statistics [17], coding theory [18] and computer science [19]. In the context of, compressive imaging, however, CS has mainly focussed on image reconstruction. FS compressive imaging, on the other hand, as will be shown in this work, can be directly employed to perform situational awareness tasks without reconstructing the scene. Secondly, though CS performs better for a task without available a priori information, its performance is sub-optimal (as will be shown in this work) when such information is available, which is generally the case in many practical applications. CS, however, has an elegant mathematical rigour to it, drawing its ideas from mathematical analysis and algebra. Because of this reason, and its closeness to FS compressive imaging, we give a brief overview of CS in the following section. The work presented in this manuscript, however, will focus on our novel contributions to FS imager-based compressive imaging. 18 1.2 Compressed Sensing CS was introduced through a series of papers by Candés, Romberg, Tao [20], [21], [18], [22], [23] and Donoho [24], with many contributions from research in the field of sparse approximations [25], [26], [27], [28], [29], [30]. The main thesis of CS is that the conventional Nyquist sampling theory can be improved upon. Nyquist sampling requires the number of Fourier samples to be exactly the same as the number of pixels in an image for exact reconstruction. CS, however, says that under certain conditions involving sparsity and incoherence the number of samples required to reconstruct the image exactly with a high probability is much smaller than the image resolution. In a general setting CS can be defined as a new data acquisition paradigm which says that if the scene to be imaged is sparse in a certain representation basis, then taking compressive measurements of this scene using a measurement basis that is incoherent with the representation basis, allows us to reconstruct the scene to an arbitrary degree of accuracy. To make the CS idea concrete, let us consider a signal x ∈ RN , which we sense (or measure) through a set of vectors φm ∈ RN to get measurements ym = hφm , xi, m = 1, . . . , M. (1.1) For a conventional imager, φm will comprise the standard Euclidean basis with M = N yielding a traditional image of the object scene. On the other hand, if the 19 φm ’s are, say, the discrete Fourier basis then we take frequency measurements of the object as in MRI. Using matrix notation we can write (1.1) as y = Φx, (1.2) where φm ’s comprise the rows of the sensing matrix Φ. Given the set of measurements y, the goal is to reconstruct the signal x. CS is interested in solving this problem when the system is under-determined (M << N), i.e. the number of measurements are much less than native dimensionality of the signal. Using data acquisition terminology, we take under-sampled measurements of the signal. In general, we know from linear algebra, that this problem is ill-posed, and it has infinitely many solutions. However, if we know that the signal x is sparse in some basis, then Candés, Romberg and Tao [20] showed that it is almost always possible to exactly reconstruct x by solving the following convex optimization problem, minimize ||x||1 subject to y = Φx. (1.3) Sparsity: A vector z is said to be K-sparse if the number of non-zero entries is exactly K. Sparsity has been exploited in various fields for signal estimation and data compression. For example, shrinkage algorithm by Donoho [31] for signal estimation exploits the idea of sparsity, and JPEG2000 based transform coding uses sparsity of natural images to achieve scalable compression [32], [33]. Most of these uses, however, are in the post-processing stage for better signal estimation and data 20 modeling. Compressed sensing goes a step further. If it is known the signal is sparse in some basis, then it has a direct impact on data acquisition itself. Let us denote by matrix Ψ the representation basis in which the signal x has a sparse representation. Then we can express analysis and synthesis equations respectively as, α = Ψ† x (1.4) x = Ψα (1.5) For ease of exposition we have assumed Ψ to be an orthonormal basis, and consequently, the inverse of Ψ is the self-adjoint.4 We can now re-write (1.2) as y = ΦΨα (1.6) = Φ′ α (1.7) where Φ′ = ΦΨ. The convex program can therefore be expressed as minimize ||α||1 subject to y = Φ′ α. (1.8) Assuming α is K-sparse, Candés and Tao showed that the matrix Φ′ which can accurately estimate α should satisfy the restricted isometry property (RIP) [21]. RIP says that if we choose at random, less than or equal to K number of columns 4 The extension to overcomplete basis can be achieved by considering the generalization of the orthonormal basis of the vector space to the frame of the vector space. 21 from Φ′ , then these columns are approximately orthonormal to each other. This allows us to define the K-restricted isometry constant δK . Using this constant we can then define the condition under which we can exactly recover x. Specifically, let T ⊂ {1, 2, . . . , N}, and let |T|, the cardinality of T, be less than or equal to K. Let us also denote by Φ′T the matrix Φ′ with all columns not indexed in T replaced by zero column vectors. Then the K-restricted isometry constant δK is the smallest constant such that (1 − δK )||a||2ℓ2 ≤ ||Φ′T a||2ℓ2 ≤ (1 + δK )||a||2ℓ2 . (1.9) Using this property Candés and Tao proved [18] the following theorem. Theorem 1.2.1 Assume that α is K-sparse and suppose that δ2K +θK,2K < 1. then the solution α∗ to (1.3) is exact, i.e. α∗ = α. Here θK,2K is called the K, 2K-restricted orthogonality constant. If we define two disjoint index sets T ⊂ {1, 2, . . . , N} and U ⊂ {1, 2, . . . , N} such that |T| ≤ K and |U| ≤ 2K, and K + 2K ≤ N, then θK,2K is the cosine of the smallest angle between the two subspaces spanned by columns of Φ′T and Φ′U . Theorem 1.2.1 states that it is possible to recover sufficiently sparse signals from highly undersampled measurements. It is important to note, however, that a Ksparse signal is a strict condition rarely satisfied by real-world signals of interest. A more realistic scenario is to consider compressible signals. Compressible signals can 22 be thought of signals which are not strictly supported on a sparse set, but are instead concentrated on a sparse set with most of the signal content within or near this set. Therefore, we can think of compressible signals as nearly sparse. Furthermore, the measurements are imperfect due to presence of noise. This is accounted for by re-writing the convex program as minimize ||α||ℓ1 subject to ||y − Φ′ α||ℓ2 ≤ ǫ, (1.10) where ǫ is the noise deviation. For this realistic scenario Candés, Romberg and Tao proved that it was possible to stably recover the original signal with error (in ℓ2 -norm) no worse than if we knew the largest K terms of the signal corrupted by noise [22]. Theorem 1.2.2 Suppose that α is an arbitrary vector in RN and let αK be the truncated vector corresponding to the K largest values of α (in absolute value). Under the hypothesis δ3K + δ4K < 2, the solution α∗ to (1.10) obeys ||α∗ − α||ℓ2 ≤ C1,K · ǫ + C2,K · ||α − αK ||ℓ1 √ . K (1.11) For reasonable values of δ4K the constants in (1.11) are well behaved; e.g. C1,K ≈ 12.04 and C2,K ≈ 8.77 for δ4K = 1/5. It was further shown that the largest number of entries K that can be recovered are related to the number of required measurements through the mutual coherence 23 between Φ and Ψ. Specifically, M is upper bounded by C · the mutual coherence, is defined as √ 1 µ · K , (log N )4 where µ, N maxi,j |hφi , ψj i|. Hence, the larger the in- coherence between Φ and Ψ, the larger K can be. Seen another way, for a fixed K, the larger µ is, the smaller the number of measurements required. The goal therefore is to design a sensing matrix that satisfies the restricted isometry property for signals with large K. Although research efforts are on going, there is still no standardized methodology for designing such matrices. It has, however, been proven that random measurements (Gaussian and binary) do satisfy the RIP with overwhelming probability. The use of random measurements has allowed CS to develop non-adaptive signal acquisition protocols for applications in different areas such optical sensing [16], MRI [34] and high resolution radar imaging [35]. CS approach to compressive imaging overlaps with the FS approach to compressive imaging in that both undersample the signal of interest leading to real time data compression. However, as mentioned earlier, unlike CS which employs non-adaptive random measurements, FS imager-based compressive imaging focusses on making knowledge enhanced measurements, where a priori information about the signal of interest is used to design the sensing matrix. There are on-going research efforts to incorporate prior knowledge in the CS paradigm [36], [37]. 24 1.3 Our Contribution In this dissertation we introduce novel advancements in large field-of-view surveillence, and estimation of temporal object scene changes utilizing the compressive imaging paradigm. Both these ideas fall under the FS imager-based compressive imaging paradigm. As we show, in the former case the FS imager is an optical multiplexer that performs real-time object scene encoding. In the latter case we study various FS imagers and discuss their reconstruction-based performance. In both cases, however, we achieve real-time compression. We will develop these two ideas in parallel. For the sake of providing a simple categorization, we associate the former work with the “analysis” task, while the latter is a “reconstruction” task. By analysis we mean extracting relevant information, in this case, surveillence information, without reconstructing the scene from the non-traditional compressive object scene encodings. Reconstruction, on the other hand, self-evidently implies faithfully estimating the temporal changes in the scene from the non-traditional compressive object scene measurements. The conceptual thread that unites these two approaches is that by employing task-based compressive measurements we can achieve the desired goal, and in fact, better it with reduced overhead costs in terms of sensor and optical costs. We now introduce these two cases. 25 1.3.1 Optical Multiplexing For Superposition Space Tracking There are numerous applications that require visible and/or infrared surveillance over a large field of view (FOV). The requirement for large FOV frequently arises in the context of security and/or situational awareness applications in both military and commercial areas. A common challenge associated with large FOV concerns the high cost of the required imagers. Imager cost can be classified into two categories: sensor costs and lens/optics costs. Sensor costs such as focal plane array (FPA) yield (i.e., related to pixel count), electrical power dissipation, transmission bandwidth requirements (e.g., for remote wireless applications), etc. all increase with FOV. Some of these scalings are quite severe, with process yield for example increasing exponentially with FOV. Optics costs such as size (i.e., total track), complexity (e.g., number of elements and/or aspheres), and mass also increase nonlinearly with FOV. However, these costs are somewhat more difficult to quantify without undertaking detailed optical designs. However, to illustrate the point consider two commercial lenses from Canon: the Canon EF 14mm f/2.8L II USM and the Cannon EF 50mm f/1.8 II lenses. The former is a wide-angle lens (FOV = 114 degrees) while the latter is a standard-angle lens. The wide-angle lens requires more optical elements and a sophisticated design to maintain resolution over the field of view. As a result, the wide-angle lens uses 14 optical elements and weights 645 grams while the standardangle lens uses 6 optical elements and weights 130 grams. 26 We note that the various costs associated with a conventional approach to wideFOV imaging often prohibit deployment of such imagers on platforms of interest. For example, the high mass cost together with the electrical power and bandwidth costs of conventional widefield imagers, restrict their application on many UAV (unmanned aerial vehicle) platforms. Therefore, the motivation of this work is to reduce the various costs of wide-FOV surveillance and thus make it feasible for more widespread deployment. One typical solution to this problem involves the use of narrow-FOV pan/tilt cameras whose mechanical components often come with the associated costs of increased size, weight, and power consumption. The pan/tilt solution also sacrifices continuous coverage in exchange for reduced optical complexity. In most traditional approaches the goal in such problems is to reconstruct the scene of interest. Contrary to traditional approaches, we propose a class of task-specific multiplexed imagers to collect encoded data in a lower-dimensional measurement space we call superposition space and develop a decoding algorithm that tracks targets directly in this superposition space. We discuss the multiplexed imagers in the next chapter. For now, we assume that we have this ability and briefly explain the basic idea behind superposition space tracking. We begin by treating the large FOV as a set of small sub-FOVs (disjoint subregions of the large FOV). All sub-FOVs are simultaneously imaged onto a common 27 focal plane array (FPA) using a multiplexed imager. The multiple lens system of Fig. 1.2 is a schematic depiction of this operation. Lens shutters can also be used to control whether individual sub-FOVs are turned on, though for clarity the shutters are not drawn in Fig. 1.2. The measurement therefore is a superposition of certain sub-FOVs. A key feature of our work is applying a unique encoding for each subFOV, which facilitates target tracking directly on the measured superimposed data. Potential encoding schemes include (a) spatial shift, (b) rotation, (c) magnification, and/or combinations of these. The superposition of the sub-FOV’s leads to real-time data compression, while the encoding strategy ensures that the target information can be decoded at the post-processing stage. Figure 1.2: Multiple lenses camera capable of performing both pan/tilt and multiplexed operations. In spatial-shift encoding, each sub-FOV is assigned a shift such that it overlaps adjacent sub-FOVs by a specified, unique amount. These spatial shifts can be one dimensional (1-D) or two dimensional (2-D). In the 1-D case, the large FOV is sub- 28 divided along one dimension into smaller sub-FOVs. In the 2-D case, as illustrated in Fig 3.3(a), the full FOV is sub-divided into two orthogonal directions. Therefore, the 2-D case can be thought of as two separable 1-D cases with decoding information shared between the two. Rotational encoding (Fig 3.3(b)) assigns different rotations to each sub-FOV such that a target undergoes an angular shift in the superimposed image when it crosses between two sub-FOVs. The rotational difference between any two adjacent sub-FOVs must be unique. In a similar manner, as shown in Fig 3.3(c), magnification encoding assigns a unique magnification to each sub-FOV such that changes in the target’s apparent size can be used to properly determine its location. These encoding methods are discussed in Chapter 3. The detailed emphasis is, however, on the 1-D spatial shift case due to its relative ease of implementation, easier visualization and proof-of-concept explanation, and its straightforward extension to the 2-D case. Without loss of generality we will consider 1-D horizontal spatial shift encoding. (Note that, 1-D vertical spatial shift encoding is an equivalent case.) The decoding process refers to the algorithm applied to determine a target’s true location in object space (the large FOV). The decoding process is also discussed in Chapter 3. It is noteworthy that the encoding of the sub-FOV’s results in a compressed image which is not a traditional picture at all, but has embedded information that is unravelled by the decoding algorithm without reconstructing the large FOV object scene. This deliberate distortion of FPA data to enhance post- 29 processing performance capacity is a novel attribute of FS compressive imaging, and computational imaging in general [2]. 1.3.2 Feature-Specific Difference Imaging Subtracting successive image frames to accomplish a given task has many applications ranging from watermarking [38], [39] and material inspection [40] to video compression [41], background subtraction [42], bio-medicine [43], astronomy [44] and change detection in remote sensing [45]. In all these applications the idea of differencing is applied in two different ways. In the first, and the most common class of methods, difference image has a temporal connotation where the difference is taken between the object scene at two time instants to compute interframe change that can then be qualitatively or quantitatively analyzed depending on the task at hand. For example, interframe difference images are quantitatively classified for video compression [41], while in remote sensing multi-spectral images of the same geographical scene are acquired at two different dates and then subtracted to qualitatively and quantitatively assess changes in land coverage [45]. In the second class of problems, difference images are not used to capture change over time, but instead are used to identify objects or phenomena in an imaged scene. In some applications in astronomy for example, difference images are computed by subtracting a reference image from test images, and are then used to detect galactic 30 microlensing [44]. Here the idea is to detect a faint phenomenon instead of analyzing or exploiting temporal changes in the scene over time. In the above mentioned applications the scenes are usually imaged using conventional optics. In cases involving storage and transmission of the collected raw data, compression is generally performed as a post-processing step. As a case in point, many commercial and professional grade video cameras use some form of built-in transform coding to compress the image to a manageable size. In fact, as mentioned above, conventionally acquired images are used for performing video compression by exploiting interframe differences between them. In this second part of the dissertation we introduce different ways to estimate relative temporal changes in the object scene (sequence of interframe difference images) from compressive measurements informed by a priori scene information. This work belongs to the first class of methods where we estimate the interframe difference image between the object scenes at two consecutive time instants. (Figure 1.3 gives an example of a difference image.) The novelty lies in the proposed ℓ2 - and ℓ1 -based estimation techniques and in the feature design. Usually, compressive measurements are associated with nonlinear ℓ1 - or ℓ0 -based reconstruction techniques, and have been successfully used as such in the imaging context. Why then consider linear ℓ2 -based estimation? Because it is a linear estimation method that provides an efficient closed-form solution for the difference image estimation operator. In the rare case that the Bayesian general 31 linear model assumption [46] holds, the linear estimation operator is also optimal in the ℓ2 sense. We also show that ℓ2 -based estimation allows us to directly estimate the difference image without first reconstructing the object scene at the consecutive time instants. We show that an immediate consequence of direct estimation is the ability to exploit the spatio-temporal cross-correlation between the object scene at the consecutive time instants. We further generalize the definition of difference image to include the difference between the object scene at nonadjacent time instants and show how successive compressive measurements over the corresponding time interval can be used to directly estimate this generalized difference image. Compressive measurements are taken by optically projecting the object scene onto some measurement basis. For the noiseless case we find this optimal measurement basis, but in the presence of noise, an optimal measurement basis is not mathematically tractable. Therefore, we evaluate the performance of some possible candidate measurement bases including the optimal solution obtained through a numerical search technique. Although ℓ2 -based difference image estimation offers many advantages for practical use, and as we will see gives good results, it makes assumptions about object scene stationarity. This assumption is not strictly true. We therefore, also consider ℓ1 -based estimation which does not make any such assumptions and consequently has superior performance. It is important to note, however, that in this work we 32 (a) (b) (c) Figure 1.3: (a) Object scene xk at time instant tk , (b) object scene xk+1 at time instant tk+1 , and (c) the resulting truth difference image ∆x = xk+1 − xk . demonstrate that the approximate ℓ2 -based estimation methods do in fact perform well enough to suffice for many practical scenarios. The ℓ1 -based estimation method works by exploiting the sparsity of the difference image. We set up the ℓ1 -based estimation problem as a linear inverse problem with three different regularizers representing three different points of view of the estimation problem. The first regularizing condition simply imposes the sparsity constraint on the difference image. We then consider a modified form of total variation (TV) regularizer. When we make the reasonable assumption that the image 33 is a function of bounded variation (BV), TV is a natural measure used to capture edge discontinuites, an important feature in difference images. Lastly we look at overcomplete representations by considering a modified form of basis pursuit denoising (BPDN). We experimentally show that by using either available a priori information or that learned from training data, we get better performance than the CS paradigm of a sparsifying dictionary incoherently coupled with a random sensing matrix (Gaussian or Bernoulli/Radermacher matrices). 1.4 Dissertation Outline In Chapter 2 we describe the proposed candidate optical architectures for multiplexed imagers and discuss some FS imagers for difference image estimation. Chapters 3 and 4 are devoted to the tracking problem. Chapter 3 explains the encoding methodology and the decoding algorithm, while results showing the efficacy of this system design are given in Chapter 4. Chapter 5 details the linear and non-linear difference image estimation techniques along with the different measurement kernels, while in Chapter 6 we present the results analysing the efficacy of these estimation techniques. In Chapter 7 we discuss the conclusions of our work and also outline some future challenges. 34 CHAPTER 2 OPTICAL ARCHITECTURES In this chapter we introduce architectures for optical spatial multiplexing for target tracking. We also briefly discuss feature specific imagers [15] which inform our work on difference image estimation. 2.1 Optical Multiplexing Multiplexing in imaging has a long history. For the past three decades multiplexing has been employed in high energy (x-ray and γ-ray) astronomy for reconstructing high resolution images of the sky. Lack of ability to focus high energy radiation above 10keV [47] has led to the development of spatial multiplexing in the form of codedaperture imaging for achieving pinhole camera resolution without the attendent loss in photon count. The coding is done by masking some photodetectors on the FPA while allowing light to the others. This masking pattern on the FPA is referred to as the coded aperture or the coded mask. This coded aperture results in the formation of a non-traditional image which is then reconstructed using linear or non-linear reconstruction techniques such as simple inversion, cross-correlation [48], photontagging [49], Wiener filtering [50], [51] and maximum entropy method [52], [53], 35 among others. The key to the success of these reconstruction techniques is the design of the coded apertures. Special mention is reserved for cyclic coded masks known as modified uniformly redundant array (MURA) [54] which have been recognized as the optimal masks for coded aperture imaging. Coded aperture imaging has also been used in the visible spectrum for image reconstruction. Our approach to multiplexing is significantly different from coded aperture imaging, in that, our goal is not image reconstruction, but target tracking. Towards this end, we combine optical design with algorithmic development to altogether avoid computationally intensive image reconstruction to achieve tracking. Our second goal is to simultaneously reduce the system cost in terms of sensor and optics requirements while tracking the targets in a large FOV. Previously our colleagues, Stenner et al., have reported on the capabilities of a novel multiplexed camera that employs multiple lenses imaging onto a common focal plane [55]. A schematic depiction of this multi-aperture camera was shown in Fig. 1.2 in Chapter 1. Note that each lens can have a dedicated shutter (not shown). In this camera each lens images a separate region (i.e., a sub-FOV) within the full FOV. By appropriate choice of shutter configurations, various modes of operation are possible. In one such mode all shutters are opened in sequence (one at a time), enabling an emulation of pan/tilt operation. Another mode of operation allows multiple shutters to be open at a time. This mode employs group testing concepts in order 36 to measure various superpositions of sub-FOVs and invert the resulting multiplexed measurements to obtain a high-quality reconstructed image [56], [57], [58], [59], [60]. The third mode allows all the shutters to be open at the same time resulting in superposition of all sub-FOVs onto the common FPA. We, however, want to encode the sub-FOVs before superimposing them so that relevant target information can be decoded at the post-processing stage. Towards this end we propose two optical implementations. f ov2 f ov1 beamsplitter mirror y camera z (a) x camera (b) Figure 2.1: Optical setup capable of performing multiplexed operations with spatial shift encoding for (a) Ns = 2 and (b) Ns = 8. (See [61].) The first implementation is shown in Fig. 2.1. In this setup we employ beamsplitter and mirror combinations to form the superposition measurements. This configuration reduces the lens count and avoids the image-plane tilt associated with 37 the configuration shown in Fig. 1.2. Figure 2.1(a) shows a setup for two sub-FOVs consisting of a beamsplitter and a movable mirror, which serves as a building block for a larger system. The optical field from the left sub-FOV (f ov1 ) is reflected by the mirror followed by the beamsplitter, and then is merged with the optical field from the right sub-FOV (f ov2 ) that has passed through the beamsplitter. The rotation of the mirror around the z-axis results in the translation of f ov1 along the x-direction in superposition space, providing a means to control the overlap between f ov1 and f ov2 . Figure 2.1(b) shows an assembly of such building blocks, which can superimpose eight sub-FOVs. This implementation will serve as the basis for the experimental results presented later in Chapter 4. The second implementation shown in Fig. 2.2 further refines the idea proposed in the above implementation by using a binary combiner in a logarithmic sequence arrangement. If we consider the same eight sub-FOVs as in Fig. 2.1(b), then this design allows us to access all eight sub-FOVs, using three stages of single-plate beamsplitter/mirror pairs at each stage, and three shutters placed on the mirrors. The shutters can be opened and closed in a binary sequence from 000 (all closed) to 111 (all open, superposition operation) to multiplex the eight sub-FOVs onto the camera. Although the effective aperture of the plate beamsplitter and mirror combination increases at each stage, there is an overall reduction in complexity in comparison to the optical implementation shown in Fig. 2.1(b). For a general 38 scenario, when the large angular FOV is φ radians and each angular sub-FOV is given by β radians, the number of stages in the binary combiner is given by S = ⌈log2 φ/β⌉ and the front end effective aperture Ae required to avoid vignetting is approximately given by φ Ae = 2A(1 − tan )−1 (1 − β)−S , 4 (2.1) where A is the aperture of the camera. To obtain this relation between effective front end aperture and the large angular FOV we fix the plate beamsplitter to be at an angle of π/4 with respect to the vertical axis and adjust the mirror to the desired angle depending on the value of φ. If we define the angle of the mirror from the vertical to be γ, then for a given φ, γ = (π − φ)/4. This system gives us the ability to employ a camera whose angular FOV is smaller than the large angular FOV by a factor of 2S . We have discussed the optical scheme in a 1-D setting, but because the horizontal and vertical directions are separable, extension to 2-D is straightforward. Figure 2.2 illustrates this design concept with a specific example. The angles are shown in degrees for this example. The implementation is designed for eight sub-FOVs, each having an angular range of β = 7.5◦ . For simplicity, the sub-FOVs are assumed to be non-overlapping, resulting in an angular FOV (φ) of 60◦ . The first stage folds 0◦ to −30◦ angular range onto the 0◦ to 30◦ angular range. In the second stage the resulting 30◦ angular range is further halved to a 15◦ range using a plate beamsplitter and a mirror, which are at angles of 45◦ and 52.5◦ from the 39 vertical axis. The third stage again halves the 15◦ range to 7.5◦ which is the range of a single sub-FOV. For the third stage the plate beamsplitter and mirror angles are 45◦ and 48.75◦, respectively. If the sub-FOVs are overlapped, then the angles of the mirrors in each stage can be adjusted to implement the given overlap. Figure 2.2: Binary combiner in a log sequence arrangement for multiplexing 8 subFOVs each with an angular range of 7.5◦ . (See [61].) As the angular FOV at the end of this three-stage binary combiner is reduced by a factor of 23 to 7.5◦ , we can use a simple lens at the end of this optical setup. We consider the TECHSPEC MgF2 coated achromatic doublet with a diameter of 12.5mm and a focal length of 35mm. Given A = 12.5mm, using (2.1) we calculate the front end effective aperture Ae of the beamsplitter and mirror combination to 40 be 5.2cm. Since the optical system does 2-D imaging we have the same threestage binary combiner in the other dimension with the same effective front end aperture. In total therefore, we have six plate beamsplitter and mirror combinations. Assuming, for simplicity, that the beamsplitter and the mirror equally share the effective aperture, the lengths of the plate beamsplitter and the mirror are given by √ √ √ √ (5.2/2) 2cm and (5.2/2)(2/ 3)cm. The factors of 2 and 2/ 3 follow from the plate beamsplitter and the mirror being at angles of 45◦ and 30◦ respectively from the vertical axis. Assuming a square size for both, the Stage 3 dimensions of the plate beamsplitter are 3.7cm × 3.7cm and that of the mirror are 3cm × 3cm. Also assuming the thickness of the optical glass to be 2mm, we calculate that in Stage 1 the total volume of glass used by the two pairs (corresponding to 2-D imaging) of plate beamsplitter and mirror combination is 9.1cm3 . Doing similar calculations for Stages 2 and 3 gives the volume of glass used to be 6.8cm3 and 4cm3 , respectively. If we take the density of the optical glass to be 2.5gm/cm3 , the total mass of the log combiner is 49.75gm. The mass of the achromatic doublet is less than 5gm and hence the weight of the optical system is about 54.75gm. If we were to directly use a wide-angle lens to capture an angular FOV of 60◦ , then a potential candidate is Canon’s EF 35mm f/1.4L USM lens. It has an angular FOV of 63◦ , but its mass is 580 gm and it uses 11 optical elements. Therefore, we see savings of about a factor of 10 for our proposed multiplexed imager and also reduced optical complexity as 41 we are using a simple, easy-to-design binary combiner and an achromatic doublet as opposed to an 11-element wide-angle lens. Two practical issues that arise in designing optical systems involving beamsplitters are vignetting and transmission loss. Vignetting arises when there is nonuniformity in the amount of light that passes through the optical system for each of the points in the FOV. This resulting non-uniformity at the periphery of the superposition data has the potential of increasing false negatives which in turn can lead to errors in properly locating the targets. To overcome this potential problem in the setup shown in Fig. 2.1, the size of each beamsplitter should be large enough to ensure that the angular range subtended by that beamsplitter at the camera is a strict upper bound on the angular range of the corresponding sub-FOV. Field stops are then used to restrict the angular range of the beamsplitter to that of the sub-FOV. Specifically, for the setup shown in Fig. 2.1 and used in chapter 3, each sub-FOV is 1.9◦ horizontally and 1.3◦ vertically while the angular range of the beamsplitter is approximately 3◦ . As a result, vignetting is avoided. For the binary combiner shown in Fig. 2.2 vignetting is not an issue because (2.1) is derived from a vignetting analysis of the binary combiner, to give an effective aperture Ae that does not block light from any point in the large angular FOV. The second issue has to do with a decrease in throughput due to optical “combination loss” when the light passes through the various stages of the beamsplitters. 42 Specifically for the eight sub-FOV multiplexed imager shown in Figure 2.1(b) the optical transmission decreases by (0.5)(0.5)(0.5) = 0.125. This throughput cost, however, is no worse than that of a narrow-field imager in a pan-tilt mode which is commonly used to achieve a wide FOV [62], [63]. For our current example the dwell time of a corresponding narrow-field imager on each sub-FOV will be 1/Ns = 1/8 = 0.125 time units. Since the photon count is directly proportional to the dwell time, we have the same photon efficiency for both the multiplexed and wide-field imagers. This result can be extended to a general case of Ns sub-FOVs where the photon efficiency of both the multiplexed and wide-field imagers is reduced by a factor of Ns . We acknowledge, however, that for the proposed beam-splitter and mirror-based multiplexed imagers we have not taken the component losses, e.g. scattering at the mirror, into account. We assume these losses to be small in comparison to the photon efficiency. Unlike the beamsplitter and mirror based multiplexed imagers, the multiple-lens-based imager shown in Fig. 1.2 overcomes the disadvantage of loss in photon efficiency at the expense of a higher optical mass cost. The imagers we have proposed are used to simultaneuosly encode and compress (through superposition of the sub-FOVs) the data. The subsequent algorithm then performs target tracking by decoding the relevant information from this compressed superpositon data. In the next chapter we discuss in detail the encoding and compression, and the decoding algorithm. 43 2.2 Compressive Feature Specific Imagers Feature-specific imagers take scene measurements by linearly projecting the object space onto a measurement space using knowledge-enhanced measurement kernels. Inherent in the projection and the attendant real-time compression is the discretization of the object space with respect to a certain sytem resolution of the FS imager. The system resolution includes the resolution of both optical system and the sensor. We represent this resolution by ∆r . Let the object scene be H by W unit distance. Then the pixel dimension of the scene is h = H/∆r by w = W/∆r . Mathematically the discretized scene is generally expressed in a vector form with dimension hw × 1. Setting N = hw, we think of the scene as an N ×1 vector. Therefore N would be the dimension of the sensor array if conventional imaging were being used. Henceforth, when we talk about reduced dimensionality of compressive measurements it will be with respect to this maximal conventional imaging dimension. Thus, say N = 100, and we take M = 5 measurements, then the compression factor is N/M = 20. For the discrete representation, the measurement kernel is represented by a measurement matrix (or sensing matrix) Φ. Each row φi , i = 1, . . . , M of the sensing matrix is a basis of the compressed measurement space the object scene x is optically projected onto. This results in the measurements yi = φi x, i = 1, . . . , M, or y = Φx, (2.2) 44 where y is M × 1 measurement vector, Φ is M × N sensing matrix, and x is N × 1 object scene vector. There are three basic FS imagers for taking these compressive measurements: sequential, parallel and photon-sharing. The sequential FS imager, schematically illustrated in Fig. 2.3, consists of a single optical aperture along with an adaptive optic mask and a single photodetector. It takes one measurement in a single time-step. The adaptive optical mask changes its transmittance at each time step according to the projection φi being measured. The transmittance-weighted object scene incident irradiance is then spatially integrated by the photodetector. Additive white Gaussian noise (AWGN) is then added to the output of the photodetector to give a single measurement yi . The process is repeated for different projections over M time-steps to obtain M measurements. The parallel FS imager, shown in Fig. 2.4 sub-divides the total optical aperture into M sub-apertures, each with its own fixed optical mask and photodetector. It consequently takes all the measurements in the same time interval. The choice between sequential and parallel architectures has a direct impact on measurement noise. Let us assume that the total optical aperture diameter for both sequential and parallel architectures is D. Let us also assume that the total time to take the M measurements is T . Then the time per measurement for sequential FS imager is T /M, and the sub-aperture diameter for parallel FS 45 Figure 2.3: Sequential feature specific imager. (See [15].) Figure 2.4: Parallel feature specific imager. (See [15].) √ imager is D/ M . Consequently, for both architectures the photon count is T D 2 /M. There is, however, a bandwidth advantage associated with the parallel architecture. By bandwidth, we mean the inverse of the measurement time per projection. For sequential architecture it is M/T , while for parallel architecture it is 1/T , a constant. Thus in case of a sequential FS imager, the bandwidth scales as a function of M and therefore, so does the noise. If we let the variance of AWGN noise be σ 2 , then 46 the noise in the case of sequential imager scales as σ 2 M/T . The disadvantage of the parallel architecture is in terms of hardware costs as it requires M photodetectors. The third architecture is the photon-sharing architecture schematically depicted in Fig. 2.5. It is essentially a pipeline architecture with M stages, each stage measuring one projective measurement of the incident object scene irradiance. Each stage uses a spatial polarization modulator and a polarization beamsplitter. The spatial polarization modulator rotates the incident light according to the projection being measured, and the polarization beamsplitter directs the orthogonal component of the polarization generated by the modulator to be spatially integrated at the photodector, while allowing the rest of the incident irradiance to pass on to the next stage. The main advantage of the photon sharing architecture is its photon efficiency. Photon sharing architecture does not use optical masks with their absorptive losses and hence discards no useful photons. At the same time it is able to make all the projective measurements in time T . Consequently, like the parallel imager, this architecture has the bandwidth advantage. Its disadvantage is its relative complexity and increased hardware requirements in terms of optical elements and photodectectors. For details on the working of these optical architectures we refer the readers to work done by our colleagues, Neifeld et al. [5], [15]. The FS architectures discussed above perform incoherent imaging. Incoherent imaging systems are linear in intensity, thus limiting the sensing matrix entries to 47 positive values. The optimal, mathematically derived sensing matrix, however, can have negative entries. To bridge the gap between practice and theory we need a way to physically implement the bipolar entries of the sensing matrix without violating the positivity requirement. One such method is dual-rail signalling consisting of two complementary arms. One arm implements Φ+ (the positive entries of Φ are kept while the negative ones are set to zero) to get y+ = Φ+ x, and the second arm implements Φ− (the absolute values of the negative entries are kept while the positive ones are set to zero) to get y− = Φ− x. The resulting measurement y is then given by y = y+ − y− = Φ+ x − Φ− x = (Φ+ − Φ− )x = Φx. More flexibility can be added to this basic setup as discussed in [5]. Another constraint on the sensing matrix comes from the passive nature of the optical architecture. An imager cannot increase the number of photons collected by the photo detector. In other words, the total number of photons entering the optical system is the same as the number leaving it. This condition manifests itself as the photon count constraint, which says that the absolute maximum column sum of Φ (or the induced 1-norm of Φ) is one. To ensure this constraint is met for the sensing matrix being considered, the sensing matrix has to be normalized by its maximum column sum. Let c = maxj { P i |Φij |}. Then the sensing matrix satisfying the photon count constraint is given by Φ = Φ/c. From now on we assume that we have the capability to optically implement the 48 Figure 2.5: Photon sharing feature specific imager. (See [15].) sensing matrix satisfying the photon count constraint. In the Chapter 5 we we detail the potential projective measurement matrices that can be used in these optical architectures. 49 CHAPTER 3 SUPERPOSITION SPACE TRACKING We do not require the production of a traditional image for the purpose of target tracking. The efficacy of a target tracking system is determined only by the the accuracy of the estimated target positions. In this chapter we detail our encoding techniques and the decoding algorithm for continuously tracking targets moving in a large FOV (object scene) of interest, without incurring the computational and primary memory (random access memory) cost of reconstructing the entire scene.1 This task-specific approach to target tracking also results in real-time data compression leading to direct savings in sensor and secondary data storage (hard disk) costs apart from the optical cost advantages discussed in Chapter 2. We show that by dividing the large FOV into sub-FOVs that are optically encoded and multiplexed, we are able to successfully localize the target locations without scene reconstruction. We outline the possible encoding strategies, briefly discussing each one. We, however, focus mainly on 1-D spatial shift encoding due to ease of exposition of the proof-of-concept. We then detail the decoding algorithm. 1 This cost can be, for example, due to the inversion of the correlation matrix, or the execution of an iterative estimation technique. 50 We show how the decoding algorithm can be used for single, as well as multiple target decoding. We show this even for scenarios where the multiple targets are identical to each other, and morphological discriminating criteria cannot be used. We begin this chapter by introducing the object, superposition and hypothesis spaces. These three spaces form the key to the development of the encoding and the decoding methodology. 3.1 Sub-FOV Superposition and Encoding In a manner identical to FS imagers introduced in Section 2.2, we assume the large FOV to be H distance units in the vertical dimension by W distance units in the horizontal dimension. To best illustrate the proof-of-concept of our approach, we consider the 1-D horizontal spatial shift encoding. We next define the following three domains or spaces we will work with: object space, superposition space, and hypothesis space. The object space is the full area on the ground that is actually observed by the sensor. For sake of clarity, we begin by letting the sub-FOVs be uncoded. Uncoded sub-FOVs are obtained when the H × W area of the large FOV is simply divided into Ns adjacent sub-FOVs, without overlap, as shown in Fig. 3.1(a). Each of the resulting sub-FOVs is H × Wf ov in size where Wf ov = W/Ns . Using an optically multiplexed imager, we image each of these sub-FOVs onto a common 51 FPA. Assuming, as in Section 2.2, that the object space resolution of the optical system is ∆r distance units in each direction, the dimensionality (in pixels) of a single sub-FOV is then given by H/∆r pixels by Wf ov /∆r pixels. Optical multiplexing of all sub-FOVs onto an FPA the size of a single sub-FOV corresponds to superposition of all the sub-FOVs. This measured image comprises what we call the superposition space. Note that the superposition of all sub-FOVs onto a single FPA provides data compression - H/∆r by W/∆r pixels are imaged using an H/∆r by Wf ov /∆r-pixel FPA. Specifically, the compression ratio for the uncoded case being discussed is Ns . Measuring only the superposition, however, introduces ambiguity into the target tracking process. Consider a single target moving through the first sub-FOV in the uncoded object space as shown in Fig. 3.1(a). The corresponding superposition space looks like Fig. 3.1(b). Based on the encoding used (in this case none) and the size of a single sub-FOV we decode possible locations of the target in object space. We call this new space the hypothesis space. (See Fig. 3.1(c).) The hypothesis space is not a reconstruction of the object space - it is a visualization of the decoding logic. The single target detected in the superposition space of Fig. 3.1(b) does not provide information about the true sub-FOV where the target is located. Therefore, there is ambiguity in the hypothesis space, and we hypothesize that the target could be in any of the Ns sub-FOVs. In fact, for this uncoded case, it is not possible to 52 (a) (b) (c) Figure 3.1: (a) The area of interest (large FOV) for tracking targets along with the reference coordinate system. The large FOV is sub-divided into 4 non-overlapping sub-FOVs in the x-direction. In this un-coded case the object space and the FOV are the same. (b) Superposition space. (c) Hypothesis space: From the superposition space it is not possible to tell which sub-FOV the target belongs to. This ambiguity in target location leads to the hypothesis that the target could be in any of the 4 sub-FOVs. (See [61].) correctly decode the target location based on measurements in superposition space. To overcome this ambiguity, we need a distinguishing characteristic that appears in the superposition space yet identifies a target’s true location in the object space. This trait can be provided by encoding the sub-FOVs with a spatial shift in the object space. Instead of defining non-overlapping sub-FOVs, we allow for some 53 overlap between adjacent sub-FOVs in the object space as seen in Fig. 3.2(a). In this shift-encoded system, when a target passes through an area of overlap between two or more sub-FOVs, instead of a single target being present in superposition space there are multiple copies of the original target as shown in Fig. 3.2(b). We refer to these multiple copies as ghosts of the target. The presence of these ghosts allows the target location to be decoded as long as the pairwise overlap between adjacent sub-FOVs is unique. The overlaps can be designed in different ways. They can be random and unique such that no two overlaps are the same, or they can be integer multiples of a fundamental shift. Also, these integer multiples need not be successive, but can be randomly picked from the available set. The simplest design, however, is to let the overlaps be successive multiples of a fundamental shift, which we call the shift resolution δ. Given that there are Ns sub-FOVs, we define the unique overlaps to be the overlap set O = {0, δ, 2δ, . . . , (Ns − 2)δ}. We can now construct a 1-D spatial-shift encoded object space as follows: 1. Start with the uncoded sub-FOVs. 2. Let the first two (from the left) sub-FOVs be in the same position (nonoverlapped) as in the uncoded case. We label these sub-FOVs as f ov0 and f ov1 respectively; 54 3. Shift the ith sub-FOV (f ovi ), i = {2, . . . , Ns − 1} to the left such that it overlaps with the (i − 1)th sub-FOV by Oi−1 . Depending on the size of the overlap, it is possible that the shift causes portions of f ovi to overlap with more than one sub-FOV. (Figure 3.4 is an example.) One condition that must be satisfied is that f ovi cannot completely overlap f ovi−1 . This condition implies a restriction on the shift resolution δ. The shift resolution can range from zero (uncoded) to a maximum of δmax = Wf ov /(Ns − 1). As shown in Section 3.2, the resulting encoded object space enables the target location to be decoded. The disadvantage is that the object space now covers a smaller area than the uncoded case. The object space is largest when δ = 0, which corresponds to an uncoded case with target decoding ambiguity. The object space is smallest (covers the least area) when δ = δmax . Between these two extremes, there is a compromise between area coverage and the smallest shift resolution that must be detected in the decoding procedure. To quantify the reduction in area coverage, we define area coverage efficiency (η) to be the ratio between the area of the encoded object space and the uncoded object space. The shift resolution also affects the compression ratio (r), which is defined as the ratio of the area of the encoded object space to the area of a single 55 (a) (b) Figure 3.2: (a) A portion of an encoded object space with the target in the region of overlap between the two adjacent sub-FOVs along with its distance from the two boundaries of the overlap. The loss in area coverage due to encoding is also shown. (b) Superposition space with two target copies or ghosts. The one to the left corresponds to f ovi and the one to the right corresponds to f ovi−1 . Also depicted is the relation between the distance measures ℓ1 and ℓ2 in the object space, and the separation distance between the target ghosts in the superposition space. (See [61].) sub-FOV. The area coverage efficiency and the compression ratio are given by (Ns − 2) , 2Ns (Ns − 2) r = Ns − α , 2 η = 1−α (3.1) (3.2) where α = δ/δmax lies between 0 and 1. Small α results in better area coverage and higher compression ratio but smaller shift resolution. In addition, if we de- 56 fine a boundary as a line in object space where there is a change in the sub-FOV overlap structure, then small alpha also results in a lower boundary density, which can adversely affect the average time required to properly decode a target. The opposite characteristics are true for values of α near unity. We quantify trade-offs between area coverage, decoding accuracy, and average decoding time in more detail in Chapter 4. Figure 3.3: Schematic representation of encoding schemes: (a) Spatial shift (twodimensional case shown), (b) rotation, and (c) magnification. (See [61].) 3.1.1 2-D Spatial, Rotational and Magnification Encodings Although 1-D spatial shift encoding is the main focus of this work, we now take time to briefly describe other potential encoding schemes illustrated in Fig. 3.3. As mentioned in Section 1.3.2, 2-D spatial shift encoding can be thought of as two 57 separable 1-D spatial shift encodings. Specifically, instead of sub-dividing the large FOV into smaller sub-FOVs in only the horizontal x-direction, we also sub-divide in the vertical y-direction. Again starting with the uncoded case, if the number of sub-FOVs in the two directions are Nx and Ny respectively, the size of each resulting sub-FOV is Hf ov × Wf ov where Hf ov = H/Ny and Wf ov = W/Nx . Defining the shift resolutions in the two dimensions as δx and δy , the overlap sets are then given by Ox = {0, δx , 2δx , . . . , (Nx −2)δx } and Oy = {0, δy , 2δy , . . . , (Ny −2)δy }. The encoding procedure described above for a 1-D system can be separably applied to the subFOVs in both the x- and y-directions to give the 2-D spatially encoded object space. In this 2-D encoding scheme, each sub-FOV is characterized by a unique pairwise combination of horizontal overlap from Ox and vertical overlap from Oy . The area coverage efficiency and the compression ratio are given by η = (1 − αx (Nx − 2) (Ny − 2) )(1 − αy ), 2Nx 2Ny r = (Nx − αx (Ny − 2) (Nx − 2) )(Ny − αy ), 2 2 (3.3) (3.4) where αx = δx /(δx )max , αy = δy /(δy )max , and (δx )max and (δy )max are the maximum allowable shifts in the two dimensions. In the case of rotational encoding the objective is to define unique rotational differences between any two sub-FOVs. The simplest way to do this is to define a fundamental angular resolution (δang ) and let all the rotational dif- 58 ferences be integer multiples of (δang ). The resulting rotational difference set is R = {0, δang , 2δang , . . . , (Ns − 1)δang }. One must be careful, however, to note that R is a set of rotational differences, i.e., the difference between the absolute rotations of any two sub-FOVs. The absolute rotation assigned to the ith sub-FOV is then Pi j=0 Rj , i = 0, 1, . . . , Ns − 1. Furthermore, since rotational encoding is periodic ev- ery 360◦ , it is logical to upper bound the maximum absolute rotation by 360◦ . This bound results in a maximum angular resolution of δmax = 2 · 360◦ /(Ns (Ns − 1)). Rotational encoding like spatial shift encoding can be applied to sub-FOVs in either or both x- and y- directions. We call rotational encoding 1-D when the large FOV is sub-divided in either x- or y- direction and rotational encoding is then applied to the resulting sub-FOVs. 2-D rotational encoding refers to the case where rotational encoding is applied to the sub-FOVs resulting from the sub-division of the large FOV in both the directions. Unlike 2-D spatial shift encoding, however, 2-D rotational encoding is not separable. It requires that the rotational difference between any two sub-FOVs, regardless of the dimensions they lie along, be unique. In 2-D spatial shift encoding on the other hand, overlaps have to be unique only with respect to one direction. As a result even when Ox = Oy = O and Nx = Ny = Ns , the resulting 2-D spatial shift encoding is valid because each sub-FOV will still have a unique overlap (Oi , Oj ), i, j ∈ 0, 1, · · · , Ns − 1 associated with it. Magnification encoding assigns unique magnification factors to each sub-FOV. 59 The magnification factors depend on the optical architecture and the size of the area of interest. 2-D magnification encoding, like 2-D spatial shift encoding, is separable as long as we can separably control the magnification factors along the two directions. We can now define sets of unique magnification factors Mx and My along the x- and y-directions respectively. This results in an encoding scheme similar to the 2-D spatial shift encoding. As a result, in a manner analogous to 2-D spatial shift encoding, even when Mx = My = M and Nx = Ny = Ns , the magnification encoding scheme is valid because each sub-FOV will still have a unique combination (Mi , Mj ), i, j ∈ 0, 1, · · · , Ns − 1 of horizontal and vertical magnification factors applied to it. Finally, we note that it may be possible to obtain greater area coverage for the same FPA by combining several encoding methods. 3.2 Decoding: Proof of Concept Properly encoded spatial shifts enable decoding of the true target location whenever the target crosses a boundary into a new overlap region, and possibly sooner. Depending on the sub-FOV shift structure, target replicas or ghosts can appear only at certain fixed locations, that is, the distance in superposition space between any set of ghosts corresponding to the same target will have a unique pattern because each sub-FOV overlap is unique. From this unique pattern, we can identify the sub-FOVs involved and uniquely localize the target in hypothesis space. Once the 60 target is uniquely located in hypothesis space, we have decoded the target’s position in object space. In Sub-Sections 3.2.2 and 3.2.3 we explain and demonstrate the decoding procedure - first for a single target and then for multiple targets. We begin though with a brief discussion on correlation based tracking employed in this work. 3.2.1 Correlation-based Tracking In the simulations and performance analyses we use Kalman filter to track target ghosts in the superposition space. Our Kalman tracker has a length-four state vector, the four states being the x- and y-coordinates, and the x- and y-direction velocities of detected target ghosts. We represent this state vector as x y . s= v x vy (3.5) Kalman tracking involves two basic steps: 1. Time update: Propagating forward, in time, the current (time t) state and error covariance estimates to obtain their a priori estimates at time t + 1. 2. Measurement update: Updating the a priori estimates with the measurement data to obtain the a posteriori estimates. 61 To perform these two steps, we first build the state and the measurement models. State model : The state model is a recursive model which predicts the current state from the previous state. Here, we model the velocity of the target ghost as vx (t) = vx (t − 1) + nvx (t), (3.6) vy (t) = vy (t − 1) + nvy (t). (3.7) Thus the velocity is constant except for perturbations (change in speed) arising due to speed adjustment caused by various factors such a bend in the road, wind, etc. These perturbations in the x- and y- directions are modelled as noise terms nvx and nvy respectively. Corresponding to this velocity model, we define the coordinate model to be x(t) = x(t − 1) + vx (t − 1)δt, (3.8) y(t) = y(t − 1) + vx (t − 1)δt, , (3.9) where δt represents the incremental time-step. By defining the coordinates as a funcion of velocity we have removed the need to incorporate any noise terms in the coordinate model. Therefore our state model can now be written as s(t) = As(t − 1) + n(t), where, (3.10) 62 x(t) y(t) ,A = s(t) = v (t) x vy (t) 1 0 δt 0 0 0 0 1 0 δt and n(t) = n (t) 0 0 1 0 vx nvy (t) 0 0 0 1 . Associated with the noise perturbation n(t) is the process noise covariance matrix Q = E[n(t)nT (t)] which is dynamically updated depending on the acceleration of the ghost target. “Process” refers to the tracking of the target ghosts. Measurement Model : The measurement model is a data collection model used here to model correlation-based measurements of the x- and y- coordinates of the target ghosts. Consequently, we write the measurement model as z(t) = Hs(t) + ξ(t), (3.11) 1 0 0 0 . Associated with the where ξ(t) is the measurement noise and H = 0 1 0 0 measurement noise is the measurement covariance matrix R = E[ξ(t)ξ T (t)] which we assume to be constant and pre-compute it off-line. Based on the state and measurement models, we define the a priori and a posteriori errors respectively as e− (t) = ŝ(t) − ŝ(t− ), (3.12) e+ (t) = ŝ(t) − ŝ(t+ ), (3.13) 63 and the corresponding a priori and a posteriori error covariance matrices as M− (t) = E[e− (t)eT − (t)], (3.14) M(t+ ) = E[e+ (t)eT + (t)]. (3.15) The a priori state estimate ŝ(t− ) is the estimate resulting from the time update, while the a posteriori state estimate ŝ(t+ ) is the estimate after the measurement update. Note, ŝ(t) is the estimate of the state at time t. Our goal is to minimize the a posteriori error covariance. Towards this end, we write state estimate ŝ(t) as ŝ(t) = ŝ(t− ) + K(z(t) − Hŝ(t− )), (3.16) where K is the Kalman gain and (z(t) − Hŝ(t− )) are the measurement innovations. The Kalman gain K that minimizes the a posteriori error at current time t is given by K(t) = M− (t)HT (HM(t− )HT + R)−1. (3.17) This Kalman gain is used in the measurement update to compute the current error covariance matrix M(t) from M(t− ) by M(t) = (I − K(t)H)M(t− ), (3.18) and also plugged back in (3.16) to estimate the current state ŝ(t). In the time update step M(t) and ŝ(t) are forward propogated to get the a priori estimates M((t + 1)− ) 64 and ŝ((t + 1)− ), s((t + 1)− ) = As(t), M((t + 1)− ) = AM(t)AT + Q, (3.19) (3.20) and thus we implement the recursive Kalman tracker. For each target ghost a separate Kalman tracker is initialized. Once we have decoded the target using the state vectors of the target ghost tracks and the decoding algorithm described in Section 3.2.2, we replace the Kalman trackers of the target ghosts with a single Kalman tracker for the target. We mentioned above that the measurements we use in the Kalman tracker are correlation-based. In fact, correlation performs two specific tasks: (1) locating the target ghost positions in the superposition space which are the measurement inputs to the Kalman filter, and (2) separating weak target ghosts from noise and background clutter. If the target ghosts have a strong signal-to-noise ratio (SNR), then the two steps can be performed using simple template matching. On the other hand, if the signal strength is weak, template matching does not suffice. This is important on two counts. The first is the multiplex disadvantage [64]: photon noise from different multiplexed sub-FOVs contributes to the noise of each target ghost reducing its SNR. This is especially true for targets lying in regions of no adjacent sub-FOV overlap. In this case photon noise from all Ns object-space sub-FOVs in- 65 creases the noise floor that affects the target ghost SNR in the superposition space. The effect is somewhat mitigated for targets lying in regions of adjacent sub-FOV overlap. For example, consider the case where the target lies in a region where three sub-FOVs overlap. Here, although noise from the Ns sub-FOVs is still present, its effect is reduced due to target photons incident on the FPA from three multiplexed overlapping sub-FOVs. The second is that in our proposed technique superposition of the sub-FOVs results in a reduction of the target ghost’s dynamic range. This reduction also follows the same trend as the multiplex disadvantage. Specifically let us consider Ns object-space sub-FOVs, each with a dynamic range of [0, R]. Let the target be present in only one sub-FOV and in a region that does not overlap with adjacent subFOVs. When we superimpose these sub-FOVs, the dynamic range of the resulting superposition space, neglecting saturation, is [0, Ns R] while the dynamic range of the target is still [0, R]. Therefore, the target strength compared to the background is suppressed by a factor of Ns . If the target is in a region of overlap of M sub-FOVs, M < Ns , then this factor is reduced to Ns /M. The above analysis shows that there is a trade-off between the target ghost signal strength and dynamic range, the number of sub-FOVs (Ns ), and the size of the object space. (For a given Ns , the object space size is related to the number of overlaps M and the shift resolution δ.) This necessitates that our system be able 66 to deal with presence of weak targets. There is a two-fold strategy we consider. First, use background subtraction technique to remove the background. Second, use correlation filters to locate the target ghosts. We briefly look at each of these. Background subtraction is an intuitive and a well studied paradigm for reducing background clutter and thereby increasing target SNR and dynamic range. In almost all real life cases, the background is non-static and consequently, many statistical background subtraction techniques have been proposed and studied. In one such class of methods, depending on the complexity of the background, the background image pixels are modelled as having Gaussian probability density functions (pdfs) [65] or as mixture of Gaussians (MoG) [66], [67]. If the parametric Gaussian mixtures do not embody enough complexity then kernel based methods have also been proposed in the literature [68]. All these methods fall under the category of non-predictive methods. Predictive methods, on the other hand, generally employ Kalman tracking-based techniques to characterize the state of each pixel for background estimation [69]. Recently Jodoin et al. [70] proposed a novel spatial approach to background subtraction which works under the assumption of ergodicity: temporal distribution observed over a pixel corresponds to the statistical distribution observed spatially around that same pixel. They model a pixel using unimodal and multimodal pdfs and train this model over a single frame. This method allows us to estimate the background from a single image frame which results in a faster algo- 67 rithm that requires less memory. Since background subtraction is not the main focus of this work, we adopt a simple, albeit a slightly crude averaging method to estimate the background. To do this we observe the scene of interest for a certain length of time to acquire a larger number of encoded and superimposed sub-FOV images, which we then average to estimate the background of the superposition space. The challenge here is the requirement of a time sequence of superposition space image frames having no target motion. This is generally not possible in real scenarios. If, however, we can ensure that the number of targets are relatively small, say by imaging at a certain time of the day, then this averaging procedure does work. Background subtraction removes background clutter, but not necessarily the noise present in the measured data. To further increase measurement robustness against this residual noise we employ advanced correlation filters. Correlation filters, because of their shift invariance and their distortion tolerance ability, have been successfully employed in radar signal processing and image analysis for pattern recognition [71], [72], [73]. Here we adapt minimum-variance synthetic discriminant functions (MVSDF) [74] to detect2 target ghost locations in the presence of residual noise. The basic idea behind an MVSDF filter is simple: design a filter based on a 2 MVSDF filter is also used for classification. We, however treat differentiating between two targets as a target decoding problem and use MVSDF exclusively for detecting target ghosts and identifying their locations. 68 certain number of training images, which when matched to a test image outputs a specified value with minimum residual noise variance. This method can be thought of as a generalization of template matching to the case where test image need not be one of the training images. Such test images are treated by MVSDF as noisecorrupted versions of the training images. The training images here refer to the rough templates of the various targets of interest. Since noise robustness is built in MVSDF, the templates need not be very exact as required in classical template matching. Let us denote the P training images, scanned into p-length vectors, by gi , i = 1, . . . , P (p > P ). We can collect these images in a matrix G. Let us also define the MVSDF filter as h. In the absence of noise we want hT gi = 1, i = 1, . . . , P (p > P ). (3.21) In the presence of noise, however, there will be additional contribution from hT n where n is the noise vector. Our goal, therefore, is to find an h such that the output variance due to noise, hT Cn h is minimized while still satisfying (3.21). (Here Cn is the noise covariance matrix.) This results in a constrained optimization problem whose Lagrangian is given by L = hT Cn h − ΣPi=1 λi (hT gi − 1). (3.22) 69 On differentiating L with respect to h and equating the result to zero, the solution in the matrix form is given by hopt = C−1 n Gm, (3.23) where m = [λ1 , · · · , λP ]T . Noting that (3.21) can be written as GT h = 1, where 1 is a p-length vector of 1’s, it is easy to see that −1 m = (GT C−1 n G) 1. (3.24) For the inverse to exist, we need Cn to be positive semi-definite, and GT G to be full rank. For our simulation we assume the noise to be i.i.d. Gaussian resulting in the noise covariance matrix being an identity matrix scaled by the noise variance. We empirically found that using different training images to construct G is sufficient to ensure that GT G is full rank. Once we have obtained the MVSDF filter we use it as a mask to correlate with the superposition space image to identify the target locations. 3.2.2 Decoding Procedure for a Single Target Consider a target in object space that enters the region of overlap Oi between two sub-FOVs, f ovi−1 and f ovi , as shown in Fig. 3.2(a). The corresponding superposition space looks like Fig. 3.2(b). Under the assumption of a single target, the presence of two ghosts in the superposition space indicates that the target has en- 70 tered a region where two sub-FOVs overlap. To find the two sub-FOVs creating this overlap, we first measure the distance between the two ghosts in superposition space. Let this distance be d. Based on knowledge of the shift encoding, we then calculate the set of all possible separations between two ghosts of a single target. We label this set S and call it the separation set. Recalling that the set of adjacent sub-FOV overlaps is the overlap set O, the elements of the separation set are Si = Wf ov − Oi , i = 0, 1, . . . , Ns − 2. The set S can be computed once in advance and stored for future reference. We can now define the set T2 ⊆ S such that it contains only those elements of S which are realizable in the spatial shift encoding scheme. It is important to note that not all elements of S result in a valid element of T2 . It is possible that the region of overlap between two sub-FOVs lays within (i.e., is a sub-region of) the region of overlap between the two mentioned sub-FOVs and one or more of other sub-FOVs. In such a case a target present in the two sub-FOV overlap region will always produce more than two ghosts in the superposition space. (Target 3 in Fig. 3.4(b) is an example of this scenario.) The subscript ‘2’ in T2 indicates that the target is in the region of overlap between two sub-FOVs only (as opposed to regions covered by more than two sub-FOVs). We now look at the basic principle for decoding a target in the region of overlap between two sub-FOVs. In Fig. 3.2(a), the distances ℓ1 and ℓ2 are the distances 71 from the edges of the two overlapping sub-FOVs. In Fig. 3.2(b), we see that these distances are the same as the distances from the two ghosts to the edge of the superposition space. Since ℓ1 + ℓ2 is equal to the overlap between the sub-FOVs, ℓ1 +ℓ2 is an element of O. If the measured separation d corresponds to the j th element of T2 , then we can decode f ovi−1 as the j th sub-FOV and f ovi as the (j + 1)th subFOV. Finally, our a priori knowledge of the sub-FOV locations in object space along with the position of the corresponding target ghosts in superposition space can now be used to decode the target’s x-coordinate in hypothesis space. Because we have used 1-D spatial shift encoding, the y-coordinate of the target is the same in superposition space, hypothesis space, and object space. We have now completely decoded the target location. The above example considers two overlapping sub-FOVs. We can extend the decoding procedure to the case where a single target enters a region where more than two sub-FOVs overlap. Such a scenario can arise, for example, when the overlaps Oi−1 and Oi are such that f ovi+1 not only overlaps with f ovi , but also with f ovi−1 as in Fig. 3.4 where sub-FOVs f ov2 , f ov3 and f ov4 overlap. In such cases the number of target ghosts appearing in superposition space will be equal to the number of overlapping sub-FOVs covering the target. In general, assuming this number to be M, we first calculate the sequence of pair-wise distances, from left to right, between the M target ghosts in superposition space. This sequence 72 (a) (b) (c) Figure 3.4: Three targets in the object space with the same velocities and ycoordinates depicting two scenarios both of which result in the same superposition space data. (a)Scenario 1 : Two targets in the object space with Target 1 in the non-overlapping region of f ov1 and Target 2 in the region of overlap between f ov1 and f ov2 . (b) Scenario 2 : A single target (Target 3) in the overlap between the sub-FOVs f ov2 , f ov3 and f ov4 . (c) The same superposition space arising from the two scenarios. (i) Ghost 1 and Ghost 3 in the superposition space are ghosts of Target 2 in the object space, while Ghost 2 in the superpostion space corresponds to Target 1 in the object space. (ii) Ghost 1 through 3 in the superposition space are ghosts of Target 3 in the object space. (See [61].) is referred to as target pattern dM . We then compare the sequence to the allowed length-M target patterns in superposition space, which again are known because the shift encoding structure is known. The set of allowed length-M target patterns is denoted by TM . The matching pattern determines the proper set of M sub-FOVs, from which the target’s position in hypothesis space can be fully decoded as in the 73 case above for two overlapping sub-FOVs. 3.2.3 Decoding Procedure for Multiple Targets When our above mentioned two-fold strategy is able to associate target ghosts with the correct target for multiple targets, we can simply apply the decoding procedure for a single target to all the targets individually. On the other hand, when we have scenarios where the targets have (1) identical or only subtle shape differences, or (2) such a weak signal strength that only target detection is possible, we need a way to associate the target ghosts with the correct targets. In such scenarios where direct associations are not possible, we need a procedure for decoding multiple targets. The proposed procedure is essentially the same as for a single target except for a predecoding step where ghosts in superposition space belonging to the same target are associated with each other. The procedure involves the following indirect three-fold strategy (stated here specifically with respect to 1-D shift encoding): 1. Group all detected targets in superposition space according to their ycoordinate values. Since the system has 1-D spatial shift encoding in the xdirection, ghosts belonging to the same target must have the same y-position. However, it is possible for two different targets to also have the same ycoordinate. Therefore; 2. Compare the estimated velocities of potential targets in each group. If multiple 74 velocities are detected, it is assumed that multiple targets are present, and the group is sub-divided. This step follows from the observation that ghosts belonging to the same target must have the same (2-D) velocity. Members of each target group now have the same velocity and the same y-coordinate. Finally; 3. Begin decoding process by comparing allowed target patterns to the target patterns of groups determined in the first two steps. The allowed target patterns are the sets Ti , i = 2, 3, . . . , K, where K is the maximum number of sub-FOVs that overlap. We begin with the highest order target patterns (TK ) and work down to the lowest order target patterns (T2 ). When a pattern is detected, the target position is decoded. If an allowed target pattern is not detected, the targets in the group are assumed to reside in regions of object space without overlapping sub-FOVs. When performed in order, these steps usually enable decoding of the locations of multiple targets. Under certain conditions, however, correct decoding is not possible. Figure 3.4 shows two scenarios, the first (Fig. 3.4(a)) involving two targets as was explained above and the second (Fig. 3.4(b)) involving a single target in a region with three overlapping sub-FOVs. On the rare occassions when this occurs and the targets involved happen to have the same y-coordinate and estimated velocity, 75 it is not possible to decode which scenario is the true scenario (Fig. 3.4(c)) and, according to the decoding rules above, the higher order shift pattern will be decoded. (In Fig. 3.4, scenario #2 will be decoded). We illustrate this general case in Fig. 3.5 through a snapshot of four frames from an example movie. Each frame shows the object space across the top, the superposition space in the middle, and the hypothesis space across the bottom. Object space represents the “truth” while the superposition space represents the actual measurement data. The hypothesis space visualizes how the decoding logic works in real time. We stress that the hypothesis space is not a reconstruction of the object space, but is simply a visualization of the decoding logic. The “truth” background has been added to the hypothesis space simply to provide visual perspective to the viewer. The first two frames in Fig. 3.5 show two targets with the same (2-D) velocity and the same y-coordinates moving through the object space. By unfortunate coincidence the targets happen to have a horizontal separation which is an element of set T2 . As a result, the two targets are decoded as a single target, and their decoded position jumps around. Eventually, as shown in the last two frames in Fig. 3.5, the velocities of the two targets change, the superposition space ghosts are properly grouped, and the two targets are correctly decoded. We would like to remind the reader that here we are assuming the targets have either identical shape or have such subtle differences in shape that association based on shape is not possible or reliable. We will continue to make this assumption 76 throughout this, and the next chapter. 3.2.4 Decoding via Missing Ghosts Thus far, we have described how the difference, or shift, between target ghost positions can be used to properly decode target location by uniquely identifying the overlap region that must have produced the shift. However, additional decoding logic is available to the sensor in the form of missing ghosts. Although this additional logic may not be able to uniquely decode the target, it reduces the number of target locations in hypothesis space. To explain this principle, consider the following scenario. For clarity, we focus on a single target moving through the object space. We also assume that Ns = 4 with the sub-FOVs encoded according to 1-D shifts belonging to the overlap set O = {0, δ, 2δ}. The target is in f ov0 and is moving towards f ov1 in the object space as illustrated in Fig. 3.6(a). Since there is zero overlap between f ov0 and f ov1 , superposition space has a single target as seen in Fig.3.6(b). Based on the superposition space measurement and the decoding strategy explained above, the target cannot be completely decoded. We can only hypothesize a potential target location in each of the four sub-FOVs. Therefore, hypothesis space looks like Fig. 3.6(c). The local x-coordinate of each hypothesized target in its respective sub-FOV is the same. However, we can apply our knowledge of overlaps to rule 77 (a) Target decoding error (b) Decoded position jumps around (c) Targets decoded (d) Targets remain decoded Figure 3.5: Four frames from an example movie showing error in decoding two targets with the same velocity, the same y-coordinate, and with a separation which is an element of set S2 . Correct decoding is possible only when the targets begin to differ in their velocities. In (a) the targets are incorrectly decoded as a single target. The incorrectly decoded target jumps to a different localized position in (b). Change in relative velocity allows the correct decoding of the targets in (c). Part (d) shows that targets remain successfully decoded. 78 out f ov2 . This sub-FOV cannot be allowed because if the target were truly at this position, it would imply that the target resides in an overlapping region between f ov2 and f ov3 . The absence of a second ghost in superposition space tells us this is not true. Therefore, the target can only be in f ov0 , f ov1 or f ov3 , and we have (a) (b) (c) Figure 3.6: (a) Encoded Object Space with 4 sub-FOVs. The target is in f ov0 . The local f ov0 x-coordinate lies between Wf ov − O2 and Wf ov . (b) The superposition Space with a single target indicating that the target is in a non-overlapping subFOV region. (c) Hypothesis Space with 4 potential targets. The third hypothesized target, however, lies in a sub-FOV overlap which if true would produce ghosts in the Superposition Space. The absence of ghosts rules out this hypothesized target as a potential true target. 79 reduced the target hypotheses from 4 to 3. If the target continues to move toward f ov2 , additional sub-FOVs will be ruled out by the same logic. In general, missing ghosts can be used to rule out anywhere from one to all incorrect sub-FOVs depending on the target location and encoding structure. 80 CHAPTER 4 SUPERPOSITION SPACE TRACKING RESULTS To demonstrate and quantify the efficacy of the proposed 1-D spatial shift encoding scheme, we present results generated from both simulated data and a laboratory experiment. The simulated data results include an example simulation that details the various facets of the decoding procedure. The results also present system performance curves based on two criteria: average decoding time and probability of decoding error. They include a detailed discussion of the performance trade-offs between average decoding time, probability of decoding error and area coverage efficiency. We conclude the chapter with results from a laboratory experiment showing the success of the system in practical scenarios. 4.1 Simulation We simulated an object space with Ns = 8, 1-D spatial-shift encoded sub-FOVs. The size of each sub-FOV was 64∆r distance units by 64∆r distance units, where ∆r , the object space resolution, was assumed to be finer than the size of the targets of interest. The corresponding sub-FOV dimensionality (in pixels) is 64 by 64. 81 Figure 4.1: Examples of valid target patterns. (See [61].) We first defined the overlap set as O = {0, 1δ, 2δ, 4δ, 3δ, 5δ, 6δ} where δ was chosen as δ = δmax = f loor(Wf ov /(Ns − 1)) = floor(64/7) = 9 pixels. The reason for this choice of δ was that large δ increases the boundary density in the superposition space, which increases the total overlapped area. As a result, the number of target ghosts in superposition space that need to be tracked increases. If the decoder and tracker can handle a large number of targets in superposition space, we can gain confidence that the decoding procedure is robust. Note that the overlaps in the set O are not monotonically increasing, which shows that the order of the overlaps in 1-D shift encoding is arbitrary. Based on this overlap set, the separation set is computed to be S = {64, 55, 46, 28, 37, 19, 10}. The allowed target patterns are then given by T2 = {64, 55, 46, 28, 37, 10}, T3 = {{37, 28}, {19, 37}, {10, 19}} and T4 = 82 {{10, 19, 37}}. The these patterns are F2 sets = of overlapping sub-FOVs corresponding to {{0, 1}, {1, 2}, {2, 3}, {3, 4}, {4, 5}, {6, 7}}, F3 = {{3, 4, 5}, {4, 5, 6}, {5, 6, 7}} and F4 = {{4, 5, 6, 7}}. out that {5, 6} is absent from the set F2 . It is important to point This does not imply that the two sub-FOVs do not overlap, which they do, but instead means that the region of overlap between f ov5 and f ov6 is also overlapped by a third or a even a fourth sub-FOV as shown in the sets F3 and F4 respectively. In Fig. 4.1 we illustrate the allowed target patterns with a couple of examples. We simulated a scenario by populating the object space with four identically shaped targets appearing at random locations, with random velocities, at random times, and lasting for random durations. We allowed the starting target locations to be anywhere in the object space with equal probability. The velocities were uniformly distributed between 0 and 3∆r distance units for every time step to ensure that the target movement looked real. The start and stop times were uniformly distributed between 0 and 100 time steps. Figure 4.2 shows a series of frames of one such example simulation which best illustrates all the facets of our decoding procedure. We explain the figure in the next paragraph. Our algorithm is able to handle many more targets, including identical targets, than the four chosen here, but using only a few targets allows the reader to easily follow the decoding logic shown in Fig. 4.2. The identical shapes for the targets have been chosen to emphasize the 83 Superposition Space Hypothesis Space (a) Red target appears (b) Red target moving (c) White target appears (d) White target ambiguity reduced via missing ghost logic (e) Red target ambiguity reduced via (f) Red and white targets decoded via missing ghost logic missing ghost logic Figure 4.2: Decoding of 4 targets: Each frame depicts the encoded object space (δ = δmax ) at the top, the corresponding measured superposition space (with static background subtracted out) in the middle, and the resulting hypothesis space at the bottom. Parts (a) and (b) show the red target. In part (c) the white target appears whose ambiguity is then reduced via “missing ghosts” logic in part (d). Red target ambiguity is also reduced in part (e), and both red and white targets are decoded via “missing ghosts” logic in part (f). 84 (g) White target ambiguity increases (h) Blue target appears in the region of overlap (i) Blue target decoded immediately (j) Red target ambiguity increases Figure 4.2: Decoding of 4 targets contd.: Part (g) shows the white target ambiguity increasing. In part (h) the blue the target appears, and is almost immediately decoded in part (i). Part (j) shows an increase in ambiguity of the red target. 85 (k) Green target decoded (l) White target decoded 20 40 60 80 100 120 140 160 180 50 (m) Red target decoded 100 150 200 250 300 (n) All four targets are successfully decoded Figure 4.2: Decoding of 4 targets contd.: Parts (k), (l), (m) successively show the green, white and red targets being decoded. In part (n) all targets have been successfully decoded 86 robustness of the algorithm to lack of morphological distinctions between targets. The series of frames in Fig. 4.2 show four identically shaped and color-coded targets in the object space. Color coding allows easy target discrimination for the reader. The first target in the object space is the red target which is followed by a white target. Both these targets are decoded via missing ghosts logic as they travel through the object space. This can be seen in the hypothesis space where the ambiguity in the locations of these two targets is completely removed even before the targets reach a boundary. The green target appears next and is followed by the blue target. Since the blue target appears in a region of overlap of f ov1 and f ov2 , it is almost immediately decoded. The green target is decoded when it reaches the region of overlap between f ov1 and f ov2 . Thus all four targets are successfully decoded. In Figs. 4.2(g) and 4.2(j), however, we see that the white and red target ambiguity respectively increases. This is primarily due to the high speed of the targets while crossing the boundary between the non-overlapping sub-FOVs f ov0 and f ov1 , which did not give the Kalman tracker enough time to decode the target ghosts that appeared in the superposition space when the two targets were on the boundary. This scenario has been included to depict a real-world scenario and is elaborated upon in Section 4.2. The two targets are decoded when they reach a region of sub-FOV overlap. Figure 4.2 shows us that the decoding time of targets differs depending on the 87 target location and the target velocity. Also, because of possible errors in measurement it is possible that we incorrectly decode a target. This is especially true for targets with very low SNR. Therefore, decoding time and measurement errors can affect system performance. We next consider two metrics useful in quantifying this system performance. One metric considers the effect of shift resolution on average decoding time and the other considers the probability of incorrect decoding. We investigate these metrics in the following sub-sections. 4.2 Average Decoding Time and Area Coverage Efficiency (η) A small shift resolution (small α) implies that the degree of overlap between adjacent sub-FOVs is small. One potential disadvantage of small shift resolution is that the average time it takes to decode a target (decoding time) increases. The problem can arise as follows: when a new target ghost appears in superposition space, velocity estimates are not instantaneously available. Therefore, it takes some time to determine if the new ghost should be associated with an existing group, or if it is due to a new target altogether. To get velocity information we must wait a few time steps while the Kalman tracker updates the state vector and obtains a stable velocity estimate. This waiting period can present a problem, especially when the target is in a sub-FOV with a small overlap. If the target has a high x-velocity, it is possible that the target may not stay in the overlap region for 88 enough time for the Kalman tracker to ascertain the target velocity. This results in increased decoding time. (This is what happened to the white and red targets in our example simulation.) Moreoever, for systems with small shift resolution, the distance between two different overlap regions is relatively large (equivalent to lower boundary density), which again tends to increase decoding time. 40 38 36 Time Steps 34 32 30 28 26 24 22 20 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Area Coverage Efficiency (η) Figure 4.3: Plot of decoding time as a function of the area coverage efficiency η. The plot shows the error bars representing the ±1 standard deviation of the decoding time from the mean. (See [61].) A large shift resolution (α close to 1), on the other hand, suffers less from the above mentioned disadvantages, but has a smaller area coverage due to larger overlaps. Hence, shift resolution controls the trade-off between decoding time and area coverage efficiency. A small shift resolution provides larger area coverage, but larger decoding times. A large shift resolution provides smaller area coverage, but shorter 89 decoding times. We quantify this result by plotting the decoding time as a function of the area coverage efficiency in Fig. 4.3. The plot also shows the error bar representing ±1 standard deviation of the decoding time from the mean. The plot was computed by averaging the decoding times of 300 targets which passed through the object space in batches of size uniformly distributed between 5 through 10. In each batch the targets appeared at random times with random velocities and at random locations, and for random durations in a manner identical to the targets in the movie in Fig. 4.2. The plot shows an approximately linear relationship between the two metrics. This is expected because displacement and time are linearly related through velocity. An increase in overlap due to a larger shift resolution for the same target with the same target velocity decreases the amount of time it takes for the target to reach the regions of overlap because these regions are now wider and cover more area. The reduced decoding time is directly related to this reduced travel time to reach the regions of overlap. The reasoning is the same when using “missing ghosts” logic to decode targets. The larger the overlaps, the faster we can reduce the ambiguity between hypothesized targets, and the smaller the decoding time. It is important to note here that complete decoding using “missing ghosts” plays a less prominent role in affecting the decoding time because it can completely decode only those targets which are in the first sub-FOV from the left. For all other cases the absence of shifts reduces, but does not completely eliminate, all the target 90 ghosts and as a result does not affect decoding time. The approximate linear relationship between decoding time and area coverage efficiency leads us to the interesting observation that there is no optimtal shift resolution. For any value of the optimal shift resolution we either improve decoding time and reduce area coverage efficiency, or do vice-versa. This is true for both 1-D and 2-D spatial shift encoding/decoding strategies. The random or monotonically increasing ordering in the overlap set O also has no implication on this linear relationship. It seems possible, however, that by combining spatial shift encoding with rotation and/or magnification encodings we can design an optimal encoding strategy. This is a possibility because by combining different encoding strategies we can ensure that if one encoding scheme is weak at one spatial location we can design the other one to be stronger. By defining the decoding strength of an encoding strategy, for a given sub-FOV, as the average time to decode the target location for the target that first appears in that sub-FOV, we can quantify the decoding strength for a each sub-FOV and encoding scheme. Then using basic graph theory, we can simply trace the optimal path1 through the nodes - where each node is specified by the decoding strength of a sub-FOV and an encoding scheme - that gives us the shortest overall decoding time. This approach can be a possible starting point for designing optimal optical multiplexing schemes in the future. 1 This is not the travelling salesman problem, which is NP-complete. 91 4.3 Probability of Decoding Error We next consider the probability of decoding error, which is defined as the probability that the target pattern decoded from superposition space measurements is an incorrect pattern. In the presence of noise and other distortion, the estimated position of a target ghost in superposition space will be subject to error. Therefore, the difference between two ghost positions, which is the criterion for target decoding in a shift-encoded system, will also be subject to error. These errors can lead to the wrong pattern being detected, which will cause the target to be decoded to the wrong location. Furthermore, as the shift resolution δ is decreased, more fidelity in estimating target shifts is required. Figure 4.4 shows the results from a simplified calculation of decoding error for a single target present in a region of overlap between M sub-FOVs. We first consider overlaps between two sub-FOVs and then extend the result to overlaps between three and four sub-FOVs. We assume, as we did for the simulation example in Section 4.1, that the imaging system superimposes 8 sub-FOVs onto a single FPA. The width of each sub-FOV is 500∆r distance units, where ∆r is the object space resolution. We assume that the error in estimating the position of a ghost in superposition space is Gaussian distributed with variance determined by the Cramer-Rao Bound 92 (CRB) [46] applicable to this problem. The CRB is var[x̂] ≥ 1 , SNR · Brms (4.1) where SNR is proportional to the target intensity and Brms is the root-mean-square (rms) bandwidth of the target’s intensity profile in the encoded dimension. The Brms is given by Brms = = R Wt /2 dt(x) 2 −Wt /2 dx dx R Wt /2 (4.2) t2 (x)dx −Wt /2 R inf − inf (2πF )2|T (F )|2 dF R inf |T (F )|2dF − inf , (4.3) where T (x) is the target’s intensity profile and T (F ) is its Fourier transform. We have used a symmetric triangular intensity pattern, which has a closed-form rms bandwidth of 12/Wt2 [pixels−2 ] where Wt is the target width in pixels. We first consider the case of a single target present in a region overlapped by two sub-FOVs (M = 2). The target pattern for this case consists of a single distance between two ghosts. If the position errors on the two ghosts are independent, then the variance of the distance estimate is twice the variance in (4.1). We assume that the target is decoded only if the measured overlap matches an allowed overlap from the overlap set O to within some prescribed tolerance ǫ. (Note that the distance between the two ghosts is related to the overlap through Si = Wf ov − Oi , i = 0, 1, . . . , Ns − 2. See Section 3.2 for more details.) If the measured overlap is not 93 within ǫ of a valid overlap then the target remains undecoded. For example, if the true shift is mδ, a decoding error is made only if the measured overlap dˆ satisfies (m + k)δ − ǫ ≤ dˆ ≤ (m + k)δ + ǫ where k is a non-zero integer and (m + k)δ is a valid overlap. The probability of decoding error can now be computed by integrating the Gaussian error probability distribution over the error region. This error probability is conditioned on mδ being the true overlap. Therefore, to calculate the total probability of decoding error we have to know the a priori probability of the true overlap being mδ. We assume that this probability is uniformly distributed. The value of ǫ, in general, is dependent on the structure of the overlap set O. We, however, have chosen the overlaps to be multiples of the shift resolution δ. As a result all the overlap values are equally spaced from the adjacent ones. We can therefore let ǫ be some fixed tolerance value less than or equal to δ/2, where δ, the shift resolution, is the separation between two successive overlaps. When ǫ = δ/2 we always make a decoding decision. If ǫ < δ/2, there are cases where the measured overlap does not lie within the tolerance limit of any element of O and we let the target remain undecoded. In contrast, if the overlap set contains random but unique overlaps, the tolerance ǫ is a function of O. For instance, the ǫ tolerance value for overlaps with a large spacing between them will be different from the ǫ tolerance value for the overlaps that are closely spaced, especially for the case where we always make a decoding decision. The tolerance value, therefore, will have to be adjusted 94 according to the overlaps. 0 Probability of Decoding Error 10 SNR = 0 dB, M = 2 SNR = 0 dB, M = 3 SNR = 0 dB, M = 4 SNR = 5 dB, M = 2 SNR = 5 dB, M = 3 SNR = 5 dB, M = 4 SNR = 10 dB, M = 2 SNR = 10 dB, M = 3 SNR = 10 dB, M = 4 −2 10 −4 10 −6 10 −8 10 −10 10 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Area Coverage Efficiency (η) Figure 4.4: Plot of decoding error versus area coverage efficiency for different SNRs and different values of M, the number of sub-FOVs that overlap. (See [61].) We can extend the above result to the general case where M sub-FOVs overlap. In our example we can have a maximum of M = 4. For the case of M = 3, instead of ˆ we measure two overlaps dˆ1 and dˆ2 resulting from the measuring a single overlap d, pair-wise distances between three target ghosts in the superposition space. Therefore, we now have a 2-D Gaussian error probability distribution. The probability of decoding error is calculated by integrating this 2-D distribution over the region given by (m1 +k1 )δ−ǫ ≤ dˆ1 ≤ (m1 +k1 )δ+ǫ and (m2 +k2 )δ−ǫ ≤ dˆ2 ≤ (m2 +k2 )δ+ǫ. Here m1 δ and m2 δ are the true overlaps, and k1 and k2 are non-zero shifts such that (m1 +k1 )δ and (m2 +k2 )δ are valid overlaps. We again assume that the probability of 95 the true overlaps being m1 δ and m2 δ is uniformly distributed. Extention to M = 4, where we have a 3-D Gaussian error probability distribution is straightforward. Figure 4.4 shows the probability of decoding error versus area coverage efficiency for ǫ = δ/2, Wt = 10∆r distance units, and different values of SNR and M. As the shift resolution decreases, area coverage efficiency increases, but so does probability of decoding error. Thus, the choice of shift resolution is a compromise between the area coverage and the probability of incorrectly decoding the target location. We also observe that for fixed SNR, as we increase M the decoding error decreases. We therefore conclude that longer target patterns make the decoding process more robust and less prone to decoding errors. 4.3.1 Experimental Results To illustrate how 1-D spatial shift encoding of multiple sub-FOVs can be performed in real-world applications we conducted an experiment using the optical setup proposed in Fig. 2.1(b). (The experiment was performed at Duke University in collaboration with Dr. David Brady.) The object space used for the experiment was an aerial map of the Duke University campus, and a laser pointer was moved across it during the video acquisition to simulate a single moving target. The object space was 24-mm high and 162-mm wide, and was imaged using the multiplexer in Fig. 2.1(b) onto a commercial video camera (SONY, DCR-SR42). By 96 adjusting the tilts of the mirrors shown in Fig. 2.1(b), we obtained an overlap set O = {3, 7, 11, 20, 14, 24, 28} × 1mm which deviated slightly from the ideal scenario of {0, 5, 10, 20, 15, 25, 30} × 1mm. In building the setup, care was taken to make the path lengths travelled by light from each sub-FOV close to equal. However, there was a slight difference in path lengths which resulted in varying magnification of some sub-FOVs. Therefore, the size of each sub-FOV was not uniform: Wf ov = {35, 34, 33, 35, 33, 32, 33, 34}×1mm and Hf ov = {24, 23, 22, 23, 22, 22, 22, 23}×1mm, where the ith elements of Wf ov and Hf ov are the width and height of f ovi , respectively. Figure 4.5 shows three frames illustrating the efficacy of our algorithm using this experimental setup. Each frame shows the measured superposition space along with the corresponding hypothesis space. Using the decoding logic discussed in Section 3.2 we are able to decode the moving target as it enters the region of overlap. The figure shows how the “missing ghosts” logic reduces ambiguity about the target’s true location in the hypothesis space. The small deviations of the overlaps from their true values do not affect the performance because all the overlaps are still unique. Uniqueness of overlaps is both the necessary and sufficient condition for the applicability of our decoding strategy. Also, the slight variations in magnification of the sub-FOVs do not affect the decoding performance. 97 Superposition Space Hypothesis Space (a) Undecoded target (b) Target ambiguity reduced via missing ghost logic (c) Decoded target Figure 4.5: Experimental data frames showing successful decoding of a target moving through the object space. Part (a) shows the undecoded target whose ambiguity is reduced via “missing ghosts” logic in part (b). Part (c) shows the correctly decoded target. 98 CHAPTER 5 FEATURE-SPECIFIC DIFFERENCE IMAGING In this chapter we present FS compressive imagers to estimate the temporal changes in the object scene of interest. The temporal changes are modelled using difference images. Where possible, we design the optimal sensing matrix for the FS imagers. In cases where sensing matrix design is not tractable, we consider plausible candidate sensing matrices that use the available a priori information. We also consider non-adaptive sensing matrices and compare their performance to the knowledge enhanced ones. In conjunction with sensing matrix design, we also develop closed-form and iterative techniques for estimating the difference images. We specifically look at ℓ2 - and ℓ1 -based methods. We show that ℓ2 -based techniques can directly estimate the difference image from the measurements without first reconstructing the object scene. This direct estimation exploits the spatial and temporal correlations between the object scene at two consecutive time instants. We further develop a method to estimate a generalized difference image from multiple measurements and use it to estimate the sequence of difference images. For ℓ1 -based estimation we consider modified forms of the total variation (TV) method and basis pursuit denoising (BPDN). We also look at a third method that directly exploits the sparsity of the 99 difference image. 5.1 Linear Reconstruction 5.1.1 Data Model Let x1 and x2 be the object scene at two consecutive time instants t1 and t2 respectively. Following the explanation in Section 2.2, the object scene at both time instants is assumed to be discretized and is represented as a vector of size N × 1. Let us also define Φ1 and Φ2 to be the two corresponding optical sensing matrices of size M × N. The M rows imply taking M measurements of the object scene. Thus, the sensing matrices can be thought of as M × N projection matrices that project the scene from an N-dimensional space to an M-dimensional subspace. Using these sensing matrices we take measurements of the scene at the two time instants. The data model is given by y1 = Φ1 x1 + n1 , (5.1) y2 = Φ2 x2 + n2 , (5.2) where n1 and n2 represent the sensor AWGN noise with noise variance σ 2 and zero mean, and y1 and y2 are the respective measurements of the scene at the first two consecutive time instants. Our goal is to estimate the difference image ∆x1 given measurements y1 and y2 by finding the estimation operator that minimizes the ℓ2 - 100 norm of the error between the truth difference image and the estimated difference image. If we take a Bayesian approach to the linear model in (5.1) and (5.2), then, minimizing the ℓ2 -norm is the same as minimizing the Bayesian mean squared error (BMSE). The Bayesian assumption allows us to represent the scene as a stochastic process and as a consequence, allows us to incorporate the (spatial) auto- and (spatio-temporal) cross- correlation information between the scene at the two time instants in the estimation operator. Using estimation theory terminology, we call the estimation operator the linear minimum mean squared error (LMMSE) estimation operator. 5.1.2 Indirect Image Reconstruction Before we present our method, we briefly discuss a possible approach - in line with classical use of MMSE operators - for estimating difference images. We call this method intermediate image reconstruction (IIR). As illustrated in Fig. 5.1(a), it involves reconstructing each object scene separately from its respective measurements and then subtracting these intermediate stage reconstructions to get the estimated difference image. The measurements can be compressive or non-compressive. The latter approach includes the conventional case of imaging the whole scene (the sensing matrix is an identity matrix) and hence avoids the reconstruction step. The disadvantage of course is related to more data collection, and not being able to op- 101 timally account for noise in the collected data. There is also the important issue of the cost of optics required to image the scene with ∆r resolution for large scenes. We therefore, focus on reconstruction using compressive measurements. Reconstructing the object scene at both time instants means that (5.1) and (5.2) can be separated into two stand alone problems. We define the reconstructed intermediate stage object scene for the two time instants as x̂1 = F1 y1 , x̂2 = F2 y2 (5.3) where Fi , i = 1, 2 are the linear reconstruction operators. For each i we separately minimize the BMSE J(Fi ) = E[||xi − x̂i ||2ℓ2 ] (5.4) with respect to Fi . The resulting reconstruction operators F1 and F2 are given by the well known MMSE equation −1 T −1 T −1 Fi = (R−1 x + Φi Rn Φi ) Φi Rn y, i = 1, 2, (5.5) where Rx is the auto-correlation matrix of the object scene and Rn is the noise covariance matrix. We always assume that we have already subtracted off the mean from the scene. If this is not the case we can trivially modify (5.5) to account for the mean. If we make the additional assumption that the first two moments completely describe the scene statistics, then (5.5) will be the optimal solution. These 102 assumptions, however, are rarely true in practice. Despite this restriction, as shown in Chapter 6, it turns out that LMMSE operators are good and computationally efficient estimation operators. Given the reconstruction operators, the indirectly estimated difference image is given by ∆x̂1 = x̂2 − x̂1 = F2 y2 − F1 y1 . (5.6) 5.1.3 Direct Difference Image Estimation The intermediate step involving object scene reconstruction in the IIR method is an unnecessary step. If we remove that step by reconstructing the difference image directly from the measurements y1 and y2 , we can better estimate the truth difference image. The reason is that we can now not only incorporate the spatial correlation between the pixels (auto-correlation of the scene), but also the temporal correlation (cross-correlation between the scene at the two time instants). We define the estimated difference image as ∆x̂1 = W1 y1 + W2 y2 , (5.7) where W1 and W2 are the jointly optimized estimation operators. We call this the direct difference image estimation (DDIE) technique. It is visualized in Fig. 5.1(b). We start by looking at the simplest case of no sensor noise. 103 (a) (b) Figure 5.1: (a) Intermediate image reconstruction, and (b) direct difference image estimation. 5.1.4 DDIE: Noise Absent Our DDIE approach makes an initial assumption of perfect knowledge of the scene at the instant we start taking measurements. From a system’s perspective this is a reasonable assumption to make. For example, the intial knowledge can be obtained from a sensor that has been on the scene for a long period of time. Therefore, for time instant t1 we assume perfect knowledge of the scene (Φ1 is an identity matrix) and from t2 onward we begin taking compressive measurements of the scene. We can now re-write the data model (5.1) and (5.2) as: y1 = Ix1 , y2 = Φ2 x2 . (5.8) The BMSE we have to minimize is J(W1 , W2 ) = E[||∆x − ∆x̂||2ℓ2 ]. (5.9) 104 Differentiating (5.9) with respect to W1 and W2 and equating the two derivatives to zero, we get the jointly optimized estimation operators to be (see Appendix A) −1 T −1 W1 = (R21 − R11 ) − Rδ ΦT 2 (Φ2 Rδ Φ2 ) Φ2 R21 R11 , T −1 W 2 = R δ ΦT 2 (Φ2 Rδ Φ2 ) , (5.10) (5.11) −1 T where Rδ = R22 −RT 12 R11 R12 , R21 = R12 is the (spatio- temporal) cross-correlation matrix between the scene at two consecutive time steps, and R11 and R22 are the (spatial) auto-correlation matrices of the scene at the two time instants. Now that we know the reconstruction operators, it is possible to find the optimal sensing matrix (in the ℓ2 sense). In fact, it is given by (see Appendix A) Φ2 = [X 0]QT (5.12) where Q is the matrix of the eigenvectors of Rδ and X is any rank-M orthonormal matrix. This is an expected result, as finding a sensing matrix that minimizes the mean-squared error between the truth and the estimated difference images in the absence of noise is analogous to finding a matrix that maximizes the projection variance of the object scene, and this is the principal component solution. Looking at (5.12) we see that [X 0] picks out the first M eigenvectors of Rδ to form an M-dimensional subspace where the projection variance is maximized. Since X is a rank-M orthonormal matrix, we get a rotated M-dimensional subspace. As a simple case the orthonormal matrix can be an identity matrix in which case the 105 eigenvectors are the principal components. It is interesting to note that the optimal sensing matrix solution involves the eigenvectors of Rδ . Rδ can be interpreted in the following way: given the spatial auto-correlation of scene 1 and the spatio-temporal cross-correlation between scene 1 and 2, Rδ contains the extra information we get from the spatial auto-correlation of scene 2. The optimal sensing matrix selects the directions that maximize this information. 5.1.5 DDIE: Noise Present In the presence of noise the resulting optimal sensing operators must be modified. Due to noise, the correlation information is affected and so is Rδ . The optimal LMMSE estimation operators in the presence of noise are given by (see Appendix B) W1 = ((R21 − R11 ) − (Rα − Rβ ) T −1 −1 ΦT 2 (Φ2 Rα Φ2 + Rn2 ) Φ2 R21 )(R11 − Rn1 ) , (5.13) T −1 W2 = (Rα − Rβ )ΦT 2 (Φ2 Rα Φ2 + Rn2 ) , (5.14) Rα = R22 − R21 (R11 + Rn1 )−1 R12 (5.15) Rβ = R12 − R11 (R11 + Rn1 )−1 R12 . (5.16) where Here, the no-noise-case Rδ is modified to Rα − Rβ . The matrix Rβ reflects the loss in correlation information in the presence of noise. If the noise were zero, then 106 Rβ would go to zero and Rα − Rβ would be identical to Rδ . In the presence of noise though, there is a reduction in the available correlation information and this reduction is quantified by Rβ . In the presence of noise, finding an optimal sensing matrix is mathematically intractable. As a consequence we look at a few plausible candidate sensing matrices. 5.1.6 Sensing Matrices PCA: We start by looking at two kinds of principal components (PC). For the first case we let the rows of the sensing matrix Φ be the eigenvectors of the spatial auto-correlation matrix. To a small extent, this is similar to the solution for the no-noise case in (5.12) if we were to let X be an M × M identity matrix. However, this case only considers the spatial correlation and ignores the temporal correlation. To utilize the temporal correlation information, we also consider the difference principal conmponents (DPC). DPC are the principal components of the difference image. We compute them from the spatio-temporal correlation matrix of the difference images defined as R∆x1 = E[(x2 − x1 )(x2 − x1 )T ] = 2R11 − R12 − R21 . Since, R∆x1 is a symmetric matrix, its spectral factorization will give us the difference principal components. There is a two-fold advantage to DPC. Firstly, they implicitly use both spatial and temporal correlation information. Secondly, as we are trying to reconstruct the difference images and not the object scene itself, 107 the principal components of the difference image are more suitable than PC. PCA waterfilling: PCA is a sub-optimal solution in the presence of noise as it does not adjust the energies (eigenvalues) of the eigenvectors with changing SNR. We remedy this by considering weighted PCA which redistributes the total available energy among the different eigenvectors while accounting for noise. This re-distribution is achieved by maximizing the mutual information I(x; y) between the scene x and the measurement y, assuming they are respectively N × 1 and M × 1 random vectors. We briefly discuss the sub-optimality of PCA and then give the weighted solution. Let Rx be the correlation matrix of scene x. Then Rx = UΛUT gives the eigen decomposition of Rx , where Λ is a diagonal matrix with the eigenvalues (λi , i = 1, ..., N) in decreasing order and columns of U are the corresponding eigenvectors. Now let noise be added and let the noise covariance matrix be given by σ 2 I. Note that the eigenvectors in U are also the eigenvectors for the noise because (σ 2 I)U = U(σ 2 I). (5.17) As a result, in the presence of noise the eigenvalues are given by Λ + σ 2 I. Therefore, the presence of noise simply adds the noise variance to all the eigenvalues without adapting the eigenspectrum to the given SNR. 108 The PC sensing matrix Φpc = (U(1 : N, 1 : M))T makes measurement y which lies in the subspace spanned by the first M eigenvectors. We now consider the modified sensing matrix Φwpc = Dw (U(1 : N, 1 : M))T , where Dw = diag(w1 , . . . , wM ). We are still in the subspace spanned by the first M eigenvectors, but now the diagonal elements wi , i = 1, . . . , M control the weighting given to each eigenvector. We first maximize I(x; y) given an unknown but fixed Φwpc , and then use it to compute the weights wi , i = 1, . . . , M. On maximizing the mutual information over the input distribution of x, we get (see Appendix C) I(x; y) = ΣM i=1 log wi2 λi + Pn Pn , (5.18) where the maximizing input distribution is a multi-variate Gaussian. We know that a real-world object scene is not normally distributed, but nevertheless, we show in Chapter 6 that we still get marked improvement over PC. In the ideal scenario of the scene being normally distributed this solution will be optimal. We assume the logarithm base to be 2. To find the optimal weights wi , i = 1, . . . , M, we differentiate (5.18) with respect 2 to wi2 under the constraint ΣM i=1 wi = E , where E is the total energy in the object scene. Using lagrange multipliers, the objective function can be written as 2 J(w12 , . . . , wM ) wi2 λi + Pn 2 = log − ζ(ΣM i=1 wi − E), Pn 2 E w i λi + Pn 2 M − ζ(wi − ) . = Σi=1 log Pn M ΣM i=1 (5.19) 109 From (5.19) we see that we can differentiate each summand individually to get wi2 = 1 Pn − . ζ λi (5.20) Since wi2 cannot be negative we rewrite (5.20) as + 1 Pn − = , (5.21) ζ λi + Pn 1 M = E. From (5.21) we see and we choose the value of ζ such that Σi=1 ζ − λi wi2 that the weights assigned to the different eigenvectors are a function of just λi . Equation (5.21) says: put the available energy where λi Pn λi Pn and not is large. This is the waterfilling solution [75]. We perform waterfilling for both PC and DPC resulting in sensing matrices WPC and WDPC, respectively. Optimal solution: Since it is not mathematically tractable to find an optimal sensing matrix in the presence of noise, we also numerically search for the optimal solution. We perform this search using stochastic tunneling [76]. For multi-dimensional surfaces with multiple minima – in our case the mean squared error surface as a function of the sensing matrix – it is easy to get trapped in a local minima without reaching the global minimum. Stochastic tunneling (ST) is a numerical technique that overcomes this disadvantage by applying a non-linear transformation to the error surface. Stochastic tunneling is a generalization of a Monte Carlo technique called sim- 110 ulated annealing (SA) for finding global minimum of a multi-dimensional potential energy surface. In SA the search for the global minimum is done by simulating a dynamical process of a particle rolling on the potential energy surface. The energy of the particle is controlled by the temperature and the cooling rate parameters. By developing an optimal cooling schedule (decreasing the temperature in controlled steps) the goal is for the rolling particle to successfully reach the global minimum without getting trapped in any of the local minima. The choice of a good cooling schedule, however, is a difficult one, especially for energy surfaces with a rough terrain, and consequently, SA is known to suffer from what is known as the “freezing” problem. It refers to the reduced probability of the particle escaping a local minimum due to decreasing temperature. If, however, the particle is given enough energy not to get trapped in any local minima, then the probability of the particle overshooting the global minimum increases. In fact, with the increased temperature the chance of the particle stopping in any minima local or global becomes equiprobable, and the ability of the particle to resolve different energy levels diminishes. In [76], the authors developed ST to avoid these extremes, and find the global minimum of a multi-dimensional cost function. Let us label our cost function as J(Φ2 ). Unlike (5.4), the cost function dependence has been explicitly written in terms of the sensing matrix Φ2 . The non-linear 111 transform of J(Φ2 ) is then given by JST (Φ2 ) = 1 − e−γ(J(Φ2 )−J0 ) (5.22) where J0 is the lowest cost function value yet encountered by the particle, and γ > 0 is the parameter that controls the non-linearity rate at which cost function values greater than J0 are supressed. To understand (5.22) better, consider Fig. 5.2. This figure, which is in same vein as the example given in [76], illustrates the effect of non-linear transformation on a simplified 1-D cost function. Figure 5.2(a) depicts the cost function with multiple local minima and a global minimum. The energy barrier between two adjacent local minima is higher than the energy differential between the two minima. In ST, instead of jumping over the barrier, the particle “tunnels” through it. This tunneling is achieved by the non-linear transformation. If the lowest cost function value encountered by the particle is J0 , then the nonlinear transformation collapses all values greater than J0 to the interval (0, 1) while preserving the locations of the minima. This is illustrated in Fig. 5.2(b) for J0 = 0. Thus the particle tunnels through towards the global minimum, and the optimal sensing matrix, by adjusting the value of J0 at each iteration and thus removing high energy barriers. 112 6 Particle’s current position 4 2 0 −2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −1.5 −1 −0.5 0 0.5 1 1.5 2 1 0 −1 −2 −3 −2 Figure 5.2: Given the current position of the particle (the current cost function value J0 ), the figure shows the effect of the non-linear mapping. All energy levels higher than the current particle position are mapped to the interval (0,1) thereby getting rid of irrelevant and high energy surface features, while the lower energy minima locations are preserved. 113 5.1.7 Multi-step DDIE and Lth Frame Generalized Difference Image Estimation As depicted in Fig. 5.3(a) we assume knowledge of the scene at the first time instant and from then on take measurements of the scene. Our model al- lows us to use a different sensing matrix Φ at every successive time instant. For simplicity, however, we assume Φk = Φ, k > 1. Consequently, we have {x1 , Φx2 , Φx3 , . . . , ΦxL , . . .} = {y1 , y2 , y3 , . . . , yL , . . .} sequence of measurements. Until now we have looked at the DDIE method for estimating the difference image between the object scene at the first two time instants. We now extend DDIE method to estimate {∆x̂1 , ∆x̂2 , . . . , ∆x̂L−1 , . . .} sequence of difference images from {y1 , y2 , y3 , . . . , yL , . . .} sequence of measurements. We call this strategy multi-step DDIE. We also present a different approach to estimating the difference image sequence by jointly using measurements taken over multiple time instants. The DDIE method assumes knowledge of scene #1 and takes measurements of scene #2 to estimate the difference image. Therefore, in multi-step DDIE, given measurements of the object scenes at tk and tk+1 , we need knowledge of the scene at tk . Multi-step DDIE acquires this knowledge by propagating forward the knowledge of the scene at t1 . (See Fig. 5.3(a).) The forward propagation is done using the 114 (a) (b) Figure 5.3: Multi-step DDIE: (a) Perfect knowledge of the scene is assumed at time instant t1 and measurements are made from t2 on. Difference image is estimated by propagating forward the object scene knowledge. This propagation is indicated by curved arrows from left to right. (b) At each pair of consecutive time instants tk and tk+1 , (5.23) is implemented to estimate the difference image ∆xk and propagate the scene knowledge. recursive equation ∆x̂k = recon(x̂k , yk+1), k > 1 = recon((x̂k−1 + ∆x̂k−1 ), yk+1 ), k > 1. (5.23) where recon refers to DDIE method discussed in section 5.1.5. For k = 1, we replace x̂1 with x1 because we assume perfect knowledge of the scene at t1 . Equation (5.23) takes the estimate of the scene at tk and measurements at tk+1 and estimates the difference image ∆x̂k . It then propagates forward the knowledge of the scene at tk by computing the estimate x̂k+1 = x̂k + ∆x̂k at tk+1 . This is illustrated in Fig. 5.3(b). 115 We refer to our second approach as Lth frame generalized difference image estimation (LFGDIE). Given {y1 , y2 , y3 , . . . , yL }, LFGDIE directly estimates the generalized difference image ∆x1L between the object scene at t1 and tL . To obtain the LFGDIE estimation operators we first define the LFGDIE data model, y1 = Ix1 + n1 , (5.24) y2 = Φx2 + n2 , (5.25) ··· yL = ΦxL + nL , (5.26) where I is the identity sensing matrix symbolizing complete knowledge of the initial scene. Note that this model is an extention of (5.1) and (5.2) because here we consider multiple measurements. The estimated difference image is then given by ∆x̂1L = L X W′ i yi , (5.27) i=1 where W′ i , i = 1, . . . , L are the estimation operators. Re-writing (5.27) in matrix form we have ∆x̂1L = [W′ 1 W′ p W′ L ][y1T ypT yLT ]T , (5.28) T where W′ p = [W′2 W′ 3 · · · W′ L−1 ] and yp = [y2T y3T · · · yL−1 ]T . Let us also define T T T T T T T xp = [xT 2 x3 · · · xL−1 ] , R11 = E[x1 x1 ], RL1 = E[xL x1 ] = R1L , Rp1 = E[xp x1 ] = 116 T T RT 1p , RpL = E[xp xL ] = RLp , ΦL = Φ and Φp = Ip ⊗ Φ, where Ip is a p × p identity matrix. Minimizing the BMSE between ∆x1L and ∆x̂1L by differentiating it with respect to W′1 , W′p and W′L , and equating the derivatives to zero, the reconstruction operators turn out to be (see Appendix B) W′ 1 = (RL1 − R11 − W′ p Φp Rp1 − W′ L ΦL RL1 )(R11 + Rn1 )−1 , (5.29) T −1 W′ p = (Rαp + RαL ΦT L (ΦL RLγL ΦL + RnL ) ΦL Rβp ) −1 T ΦT p (Φp RΦL Φp + Rnp ) , T −1 W′ L = (RαL + W′p Φp RβL )ΦT L (ΦL RLγL ΦL + RnL ) , (5.30) (5.31) where RαL = RLL − R1L − (RL1 − R11 )(R11 + Rn1 )−1 R1L , (5.32) Rαp = RLp − R1p − (RL1 − R11 )(R11 + Rn1 )−1 R1p , (5.33) RβL = Rp1(R11 + Rn1 )−1 R1L − RpL , (5.34) Rβp = RL1 (R11 + Rn1 )−1 R1p − RLp , (5.35) RγL = RL1 (R11 + Rn1 )−1 R1L , (5.36) Rγp = Rp1(R11 + Rn1 )−1 R1p , (5.37) RLγL = RLL − RγL , −1 RΦL = Rpp − Rγp − RβL ΦTL (ΦL RLγL ΦT L + RnL ) ΦL Rβp . (5.38) (5.39) 117 It is interesting to note that when L = 2, that is p = 0, RLγL = Rα , RαL = Rα −Rβ , W′ p disappears, W′ 1 = W1 and W′ L = W2 . Therefore, LFGDIE method reduces to the multi-step DDIE method. It is when L > 2 that we see the benefit of employing LFGDIE method. To see this let x1 and xL be the object scenes at times instants t1 and tL . Then the generalized difference image is given by ∆x1L = xL − x1 (5.40) Re-writing (5.40) we get ∆x1L = (xL − xL−1 ) + (xL−1 − xL−2 ) + · · · + (x3 − x2 ) + (x2 − x1 ) = L−1 X ∆xi , (5.41) i=1 where the right-hand side is a pairwise sum of difference images of the scene at successive time instants. The estimate of the generalized difference image ∆x1L is ∆x̂1L = PL−1 \ i−1 ∆xi . Therefore, estimation of the LFGDIE operator requires joint estimation of all the successive pairwise difference images ∆xi . This joint estimation exploits the spatial and temporal cross-correlations between the scene at all the L time steps, as is manifested in equations (5.29) through (5.39). The multi-step DDIE method on the other hand estimates PL−1 i−1 ∆x̂i , which exploits only pairwise cross-correlation between the scene at two successive time steps. This ability of 118 LFGDIE method to perform joint estimation leads to its superior performance over multi-step DDIE, as we show in Chapter 6. 5.2 Non-linear Reconstruction The advantage of linear ℓ2 -based estimation lies in its ability to provide closed form linear estimation operators that minimize the mean-squared error over an entire ensemble of object scenes. Difference images, however, are sparse and ℓ2 -based difference image estimation does not exploit this characteristic. We therefore, extend our study to look at ℓ1 -based estimation of the difference images. We are motivated by a few reasons, each of which looks at the problem from a different perspective. Firstly, as mentioned above, difference images are sparsely represented in pixel space (a finite-dimensional Euclidean space), and exploiting this sparsity for difference image estimation seems to be a natural extension of the image restoration problem. Secondly, modelling optical images as functions of bounded variation (BV) has been successfully used in image denoising and restoration. The use of an ℓ1 -based total variation (TV) measure [77] in this context has been shown to accurately estimate edge features, which are important components in difference images. Thirdly, signal decomposition using overcomplete dictionaries gives sparse signal representations with respect to atoms of these dictionaries. It has been shown that basis pursuit (BP) gives an optimal (in ℓ1 sense) solution for this signal decomposition problem 119 [78]. These three approaches fit well into the ℓ1 -based estimation of difference image. We consider ℓ1 -based difference image estimation problem as a linear inverse problem. The linearity comes from the forward data model being defined through a linear transform D (applied to the input s in the presence of noise) y = Ds + n. (5.42) The goal then of the linear inverse problem is to estimate s given the noisy measurements y. The model in (5.42) is typical of ℓ1 -based reconstruction problems. However, the model we have in (5.1) and (5.2) is not of the same form as (5.42). Consequently, we re-write (5.1) and (5.2) as y = ΦD x + n, where (5.43) n1 x1 y1 I 0 . ,n = ,x = ,y = ΦD = n2 x2 y2 0 Φ (5.44) Incorporating a sparse representation of x with respect to a sparsifying dictionary Ψ in (5.43) we have y = ΦD Ψs + n (5.45) where s is a sparse representation of x, i.e. x = Ψs. (Comparing (5.42) with (5.45), we have D = ΦD Ψ.) The solution ŝ, and therefore x̂, to the linear inverse problem 120 is then given by solving the optimization problem arg min ||y − ΦD Ψs||2ℓ2 + ξR(s). s (5.46) Here, the ℓ2 term is the fitness term controlling how much of a fit the solution is to the measured data, while R(s) is regularizer term controlling how much the solution meets the desired constraint. We define R(s) to be an ℓ1 convex regularizer, its form being decided by the three points of view we are considering. The weighting factor ξ is the regularization parameter. From the first point of view, difference images are sparse in the pixel basis. The pixel basis can be thought of as the standard basis for a finite-dimensional Euclidean space, where the dimension is that of the object scene. Consequently, Ψ is the identity matrix and s = x. But to maximize sparsity, we define the regularizer R to be be a function of ∆x1 instead of s as follows R(∆x1 ) = ||x2 − x1 ||ℓ1 = ||∆x1 ||ℓ1 . (5.47) It is evident that this ℓ1 regularizer enforces the sparsity constraint on the difference image, by favoring values closer to zero. Note that the regularizer R(s) = ||s||ℓ1 does not maximize the sparsity of the difference image, but just minimizes the ℓ1 -norm of s, and consequently, is not optimal. For the TV restoration problem we again consider Ψ to be an identity matrix because in this formulation, the function space of bounded variation is defined to 121 be on a discrete finite support. The regularizer R for the TV problem is usually defined to be either Riso (x) = Xq (∆hi x)2 + (∆vi x)2 or, (5.48) i Rniso (x) = X i |∆hi x| + |∆vi x|, (5.49) where Riso (x) and Rniso (x) are the isotropic and non-isotropic discrete TV regularizers respectively. The ∆hi and ∆vi operators are, respectively, the first order horizontal and vertical difference operators. Instead of imposing the TV condition on s (= x) however, as in (5.47), we impose it on the difference image ∆x1 = x2 − x1 . Now the regularizer is defined as either Xq Riso (∆x1 ) = (∆hi ∆x1 )2 + (∆vi ∆x1 )2 or, (5.50) i Rniso (∆x1 ) = X i |∆hi ∆x1 | + |∆vi ∆x1 |. (5.51) It is easy to see that a bounded total variation assumption for the object scene x1 and x2 results in x2 − x1 also having bounded variation. Therefore, this modified form does not violate any TV condition. For the sake of completeness we also look at difference image estimation using an overcomplete sparsifying dictionary Ψ. Specifically, we consider the dictionary to be the symmetric biorthogonal wavelet transform1 and the regularizer to be R(s) = 1 Cohen-Daubechies-Feauveau 9/7 wavelet transform [79]. 122 ||s||ℓ1 . This results in the familiar basis pursuit denoising (BPDN) model which decomposes the signal as superposition of the atoms of Ψ such that the ℓ1 norm of s is smallest of all possible decompositions over the dictionary. The set up here though is slightly different from traditional BPDN in that we include the sensing matrix ΦD in the model. In traditional BPDN the sensing matrix is an identity matrix. This modification does not affect the fundamental problem. The classical BPDN problem can be thought of as finding the regularized denoised sparse representation of the object scene from the noisy version of the scene. The modified BPDN on the other hand, finds the regularized denoised sparse representation of the object scene from measurements of the scene using the sensing matrix ΦD . T T Notice that by applying (5.45) we use s to estimate x = [xT 1 x2 ] and not ∆x1 . Therefore, because of the system constraint, there is the additional step of computing ∆x1 from the estimated x. Estimating s, however, allows us to take advantage of the correlation between the object scene at the two time instants. By solving (5.46), with R(s) = ||s||ℓ1 , we compute a joint estimate of x1 and x2 in the form of s. Note that, although x̂ is separable into x̂1 and x̂2 , s is not. Similarly, in (5.43) we jointly exploit the scene at the two time instants by considering regularization terms (5.47), (5.50) and (5.51) that are functions of the difference image. Regularizer (5.47) sparsifies the difference image, while regularizers (5.50) and (5.51) minimize the total variation of the difference image. In fact, by acting directly on the difference image, (5.43) 123 exploits the correlation between the scene at the two time instants more strongly than (5.45). Extension to estimating sequences of difference images follows directly from multi-step DDIE and more specifically (5.23). We will use the ℓ1 -based techniques in the multi-step setting. In Chaper 6 we present the performance of these three approaches. Learning sensing and sparsifying matrices Given the above approaches to ℓ1 -based difference image estimation, one question that naturally arises is whether we can learn the optimal Φ and Ψ from available training data. Carvajalino and Sapiro [37] proposed a very interesting scheme to simultaneously learn ΦD and Ψ using training data. The sparsifying dictionary is assumed to be overcomplete. This assumption makes it difficult for us to apply it to our current context. Consider the forward model for computing the difference image from the object scene at two consecutive time instants. Let the time instants be t1 and t2 . Then we have T T ∆x1 = x2 − x1 = [−I I][xT 1 x2 ] = [−I I]x. (5.52) Here [−I I] is N × 2N matrix with both x1 and x2 being N × 1 vectors. We know that ∆x1 is sparse. Therefore ideally, we would like to find Ψ such that, 124 when x = Ψθ, then θ = ∆x̂1 . This, however is not possible using the algorithm proposed by Carvajalino et al. because Ψ will be a 2N × N matrix which is not an overcomplete dictionary. Our model has a unique characteristic that the forward representation is overcomplete while the one in the other direction is not. This is unlike most compressive sensing signal models where overcompleteness of the dictionary in this other direction is exploited to achieve sparsity. We can of course compute the pseudo-inverse of [−I I], but, that is an ℓ2 solution. We therefore let ΦD be as defined in (5.44). 125 CHAPTER 6 FEATURE-SPECIFIC DIFFERENCE IMAGING RESULTS We now present our results for ℓ1 - and ℓ2 -based difference image estimation methods. We evaluate the performance using measured video imagery of an urban intersection (object scene; see Fig. 1.3) as the input into a simulation that models compressive measurements. We use a Panasonic PV-GS500 video camcorder to image the object scene. The reason we use conventionally imaged data as the truth data and simulate the compressive optical imaging system is to achieve flexibility in accurately implementing different sensing matrices Φ. This flexibility is required to analyze performance of our proposed sensing matrices in estimating sequence of difference images based on ℓ1 - and ℓ2 -norms. On the other hand, we now have to computationally do what a compressive imaging system will do optically. Consequently, instead of considering the entire 480 × 720 object scene, we reduce the problem by looking at the scene in 8 × 8, 16 × 16 and 32 × 32 blocks. The blocks are stitched together to reconstruct the difference image. To compute the spatial and temporal correlations, we use a training set comprising 6000 frames of the object scene. From each 480 ×720 frame, we chose at random 30 blocks of sizes 8×8, 16×16 and 32×32 to give us 180,000 samples to compute the 126 spatial auto-correlation matrix. To obtain the spatio-temporal cross-correlation matrix between object scenes at consecutive time instants, we select pairs of successive frames, and select at random 30 pairs of 8 × 8, 16 × 16 and 32 × 32 blocks from each frame pair, to again give us approximately 180,000 sample pairs. A pair of blocks consists of two blocks each drawn from the same region of the two consecutive image frames. Similarly, we extend the spatio-temporal correlations over longer time lengths for LFGDIE method where, instead of considering two successive frames we consider multiple consecutive frames. Once computed, these correlation matrices are stored for use in the testing stage The testing data comprises 6000 frames. These frames are sub-divided into 100 groups, each comprising the object scene at 60 consecutive time instants. The testing set includes diverse cases of multiple targets moving at different speeds and directions. All performance plots (RMSE vs. SNR and RMSE vs. number of measurements per block (M)) in the following analysis have been averaged over the 60 time steps and the 100 groups. As discussed in Chapter 5 there is no mathematically tractable optimal (ℓ2 ) sensing matrix. Presently, we consider some possible choices for Φ, which were discussed in Chapter 5. To remind the reader, they are: principal components (PC), difference principal components (DPC), waterfilled principal components (WPC), waterfilled difference principal components (WDPC), numerically computed optimal sensing 127 (a) (b) (c) Figure 6.1: Estimated difference image for SNR = 10 dB and M = 5 using (a) WDPC, (b) Optimal, and (c) Truth difference image matrix (Optimal) and Gaussian random sensing matrix (GPR). The GPR sensing matrix represents a set of fixed (aymptotically) basis projections that do not use any a priori information about the scene. The entries of the GPR matrix are Gaussian distributed with a mean of zero and a variance of one. We also consider the identity matrix (Conventional) to mimic the conventional imager. We use the conventional imager for baseline performance comparison. The conventional imager always im- 128 ages the whole scene, that is, it always makes 480 × 720 = 345600 measurements per frame. 6.1 ℓ2 -based Difference Image Estimation Figure 6.1 gives an example of an estimated difference image from a sequence of difference images using multi-step DDIE. The block size is 8 × 8, the SNR value is 10 dB and the number of measurements per block (M) is 5. For N = 480 ∗ 720 = 345600-dimensional object scene, M = 5 translates to 27000 measurements and a compression of measured data by 92.2%. The illustrated example has been computed with WDPC and optimal sensing matrices. We see that the performance of the optimal Φ is visually close to that of the truth difference image. More importantly, WDPC also estimates the difference image with good results. Note that it is much easier and computationally efficient to compute WDPC than to numerically find the optimal sensing matrix. We quantify these results by plotting the root mean squared error (RMSE) as a function of SNR. We define RMSE as RMSE = s X i (∆xi − ∆x̂i )2 / X (∆xi )2 . (6.1) i The normalization is done using the truth difference image ∆x. Figure 6.2 plots the multi-step DDIE method’s RMSE performance as a function 129 Number of Measurements per block, M = 5 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 RMSE RMSE Number of Measurements per block, M = 1 1 0.5 0.4 0.3 0.2 0.1 0 −10 PC DPC WPC WDPC GRP Optimal Conventional 0 10 0.5 0.4 0.3 0.2 0.1 20 SNR (in dB) (a) 30 40 50 0 −10 PC DPC WPC WDPC GRP Optimal Conventional 0 10 20 30 40 50 SNR (in dB) (b) Figure 6.2: RMSE vs. SNR plots for 8 × 8 block size. (a) M = 1, and (b) M = 5. M is the number of measurements per block of SNR for M = 1, 5, and for 8 × 8 block size. We see an improvement in quantitative performance with more measurements, with RMSE significantly decreasing for M = 5. Figure 6.3, however, shows that this is not true generally for increasing measurements. With increasing M the performance first improves and then can degrade. This behaviour is a direct result of enforcing the photon count constraint. For small M any additional measurement adds more information. But, because the total number of photons is fixed, as the number of measurements increase, additional information per measurement goes down. Eventually, the additive noise overwhelms the incremental information and we see a degradation in performance for larger M. This is true for PC, DPC and WPC sensing matrices. However, both WDPC and optimal sensing matrices avoid the degradation in performance with increasing measurements because they are optimized for a given SNR. Measurements 130 are considered until they improve performance. Once information per measurement begins to be drowned out by noise, the additional measurements are ignored. SNR (in dB) = 20 1 0.9 0.8 RMSE 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 PC DPC WPC WDPC GRP Optimal Conventional 5 10 15 20 25 30 No. of Measurements per block (M) Figure 6.3: RMSE vs. number of measurements M for SNR = 20 dB for 8 × 8 block size. There seems to be a certain discrepancy between the example estimates in Fig. 6.1 and the plots in Fig. 6.2(b). Even though the visually estimated difference image looks good, the plots for M = 5 have a relatively high RMSE. This discrepency is because ℓ2 -minimization minimizes the mean-squared error over an ensemble and does not explicitly enforce sparsity. Consequently, there are small deviations (from the true pixel values) spread out over the whole estimated difference image. These small deviations from the truth, when quantified against the sparse truth difference image, bias the RMSE. From Fig. 6.2 we can make a few observations about the efficacy of the various 131 sensing matrices. As expected, waterfilling does improve performance of both PCs and DPCs by weighting the projections according to noise statistics. This is more evident in Fig. 6.2(b) where we take five measurements per block. Numerically searching for the optimal sensing matrix further improves upon WPC and WDPC. However, the advantage the latter have over a numerical search is that they are much simpler to compute for changing SNR. Numerical search involves stochastic tunneling over an M × N multivariate surface. Waterfilling solution, on the other hand, linearly scales with M. Therefore, searching for the optimal solution is reasonable only if improvement in performance outweighs the increased computational cost. From Fig. 6.1 we see that this not the case. For the sake of completeness we have also plotted the RMSE performance of Gaussian random projections. We are aware that there is no theoretical framework for non-adaptive GRP to outperform sensing matrices exploiting a priori information. This is experimentally validated in the plots, which show GRP to be performing the worst. Lastly, for ℓ2 -based estimation the conventional imager outperforms all the sensing matrices. This is to be expected as minimizing the mean-squared error alone does not give an advantage to compressive measurements over a conventional imager. We need an additional constraint, which for our case turns out to be sparsity. We show in the next sub-section that we can actually beat the conventional imager performance when we enforce sparsity via non-linear estimation. We stress, however, that as seen in Fig. 6.1, the 132 qualitative performance of WDPC is visually close to that of the truth difference image. In fact, from a practical perspective, it can be used to provide a good input to a tracker. Multi-step DDIE, defined in (5.23), forms a closed loop between the estimate of the scene and the difference image. As a result, there will be degradation in performance over time. To grade the performance of multi-step DDIE, we consider a clairvoyant scenario for estimating the sequence of difference images. We will refer to this special case as single-step DDIE. Assuming we are estimating the difference image of the scene between tk and tk+1 , single-step DDIE always assumes perfect knowledge of the scene at tk , ∆x̂k = recon(xk , yk+1 ). (6.2) Obviously single-step DDIE is practically infeasible. However, it does bound the performance of multi-step DDIE and as such allows us to evaluate the efficacy of the multi-step strategy. Figure 6.4(a) shows the performance comparison between the two for 60 time steps for M = 5 and SNR = 20 dB. Since single-step DDIE assumes perfect knowledge at every stage, its RMSE as a function of time is nearly constant. The RMSE for multi-step DDIE is the same as single-step DDIE at t1 . With passing time, however, the performance of multi-step DDIE degrades. But as can be seen, the rate of degradation is slow showing that multi-step DDIE is 133 temporally robust. In Fig. 6.4(b) we plot the maximum divergence of multi-step DDIE from the ideal single-step DDIE for changing SNR. Maximum divergence is the maximum value by which the multi-step method diverges from the single-step method over the 60 time-steps. The resulting plot is a line with a small slope indicating multi-step DDIE does not diverge significantly for changing SNR. SNR 20 dB, M = 5 1 1 Multi−step Single−step 0.9 0.9 Maximum Divergence 0.8 RMSE 0.7 0.6 0.5 0.4 0.3 0.2 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.1 0 0.8 10 20 30 Frame (a) 40 50 60 0 −10 0 10 20 30 40 50 SNR (in dB) (b) Figure 6.4: (a) Performance comparison between single-step and multi-step DDIE methods for M = 5 and SNR = 20 dB using WDPC. Changing RMSE is compared over time. (b) Maximum divergence between single-step and multi-step DDIE methods for varying SNR. We now look at the performance of LFGDIE method for estimating a sequence of difference images. We claimed that LFGDIE will perform better than multistep DDIE as it is able to exploit temporal correlation between all time instants. Figure 6.5 shows that this is indeed the case. RMSE performance has improved compared to Fig. 6.2(b). As expected, however, the trends are still the same. WDPC still outperforms all other candidate sensing matrices with the exception of the 134 numerically searched optimal sensing matrix. Number of Measurements per Block, M = 5 1 0.9 0.8 RMSE 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −10 PC DPC WPC WDPC GRP Optimal Conventional 0 10 20 30 40 50 SNR (in dB) Figure 6.5: RMSE vs. SNR plots for LFGDIE method. Block size is 8 × 8 and M = 5. Until now we have considered block size of 8 × 8. In Fig. 6.6 we graph the RMSE performance as function of SNR for 16 × 16 block size. We see that for M = 1 there is an improvement in performance over 8 × 8 block size. In fact, the performance we get for 16 × 16 block with M = 1 is similar to the performance of 8 × 8 block with M = 5. Notice that M = 1, for 16×16 block size, implies 1350 measurements for the whole scene which translates to less than 0.4% measurements in comparison to the conventional imager. Thus, we see that for the larger block size we have improved performance with simultaneously higher compression ratio. Figure 6.7, however, shows that the amount of improvement in performance we get by going from block size 8 × 8 to 16 × 16 is reduced when we go from bock size 16 × 16 to 32 × 32. This happens because any stationarity that holds for small block sizes of 8 × 8 begins 135 to break for larger block sizes. As a result the spatial structure represented by the sample auto- and cross-correlation matrices obtained from training data does not completely represent the true correlation. The LFGDIE method also has the same trends as multi-step DDIE. ℓ1 -based estimation on the other hand does not depend on a stationarity assumption. Its goal is to find the best estimate based on the measured data that enforces the sparsity constraint. In the following sub-section we discuss the performance of ℓ1 -based estimation methods. Number of Measurements per Block, M = 1 1 0.9 0.8 RMSE 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −10 PC DPC WPC WDPC GRP Optimal Conventional 0 10 20 30 40 50 SNR (in dB) (a) Figure 6.6: RMSE vs. SNR performance plots for block size 16 × 16, for M = 1. 6.2 ℓ1 -based Difference Image Estimation The advantage of using the ℓ2 -norm is that we get precise linear estimation operators, which allow for quick and easy computation of the estimate of the difference image. The disadvantage, however, has to do with the inability to exploit the sparsity of 136 Number of Measurements per Block, M = 5 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 RMSE RMSE Number of Measurements per Block, M = 1 1 0.5 0.4 0.3 0.2 0.1 0 −10 0.5 0.4 0.3 0.2 8 by 8 16 by 16 32 by 32 0 0.1 10 20 30 40 50 0 −10 SNR (in dB) (a) 8 by 8 16 by 16 32 by 32 0 10 20 30 40 50 SNR (in dB) (b) Figure 6.7: Optimal sensing matrix based performance comparison between the three block sizes 8 × 8, 16 × 16 and 32 × 32 for (a) M = 1, and (b) M = 5. a difference image. Solving the convex optimization problem of (5.46) allows us to overcome this disadvantage. Here we consider the PC, DPC, WPC, WDPC and GRP sensing matrices. All abbreviations are the same as before. Gaussian sensing matrices (GRP) have been suggested in theory of compressed sensing for measuring data because they are incoherent with all representation bases. As a result they have become nearly universal in applications of compressed sensing for being able to reconstruct signals of interest without prior knowledge of the signal structure. Figure 6.8 shows examples of estimated difference image for 8×8 block size, using the three ℓ1 -based methods discussed in Section 5.2. Visually all three methods are effective, although TV method performs better than the other two. Both isotropic and non-isotropic TV regularizers have similar performance. All the results shown 137 (a) (b) (c) Figure 6.8: Estimated difference image for SNR = 10 dB and M = 5 for (a) sparsity enforced difference image, (b) TV method, and (c) BPDN method. here are for non-isotropic TV regularizer. In Fig. 6.9, we plot the RMSE performance of the TV method as a function of SNR. The block size is 8×8, and M = 1, 5, 10, 20. For M = 1, 5 all sensing matrices have similar performance. In fact, at low SNR all sensing matrices perform better than the conventional imager. Thus, unlike ℓ2 -based estimation, ℓ1 -based methods are better able to utilize the concentration of energy into a few measurements. This is surprisingly true even for GRP sensing matrix. At 138 low SNR there is a higher premium on the available energy and as a consequence, a small number of random measurements perform better than the conventional imager where the small energy is spread over all the N = 64 measurements. For M = 10, 20 the curves for different sensing matrices begin to separate. Yet the performance of PC and DPC sensing matrices is similar. This is because ℓ1 -based estimation does not directly estimate the difference image, but instead, jointly estimates the scene at the two time instants. As a result, we cannot take advantage of the difference image form which ℓ2 -based estimation afforded us. We see that waterfilling improves upon both PC and DPC, but due to the same reason, the performance of WPC and WDPC is similar too. The WPC and WDPC sensing matrices give the best RMSE performance. In Fig. 6.10 we compare the performance of the three ℓ1 -based methods by looking at RMSE vs. SNR curves for WDPC sensing matrix. Within the three ℓ1 strategies we see that the TV method performs better than sparsity enforced difference image method and BPDN. The improvement though is small, especially compared to sparsity enforced difference image method. By minimizing the gradient of the difference image, the TV method is best able to capture the changes in intensity across the difference image. On the other hand, difference images are sparse too, and hence, enforcing sparsity also gives good results. The BPDN method has a higher RMSE compared to the other two strategies. This is to be expected because 139 although BPDN also minimizes the ℓ1 -norm with respect to a sparsifying basis, it does not directly utilize the difference image as is done by the regularization term of the other two methods. Instead the BPDN method computes the joint sparse representation s of the object scene at the two time instants. This results in reduced performance. But, as corroborated in Fig. 6.8, the degradation in performance is small. Number of Measurements per Block, M = 1 Number of Measurements per Block, M = 5 10 10 PC DPC WPC WDPC GRP Conventional 9 8 8 7 6 RMSE RMSE 7 5 4 6 5 4 3 3 2 2 1 1 0 −30 −20 −10 0 10 20 30 40 PC DPC WPC WDPC GRP Conventional 9 0 −30 50 −20 −10 SNR (in dB) 10 20 30 40 50 SNR (in dB) (a) M = 1 (b) M = 5 Number of Measurements per Block, M = 10 Number of Measurements per Block, M = 20 10 10 PC DPC WPC WDPC GRP Conventional 9 8 7 8 7 6 5 4 6 5 4 3 3 2 2 1 1 0 −30 −20 −10 0 10 20 30 40 PC DPC WPC WDPC GRP Conventional 9 RMSE RMSE 0 50 0 −30 −20 −10 0 10 20 SNR (in dB) SNR (in dB) (c) M = 10 (d) M = 20 30 40 Figure 6.9: RMSE vs. SNR curves for 8 × 8 blocks, for the TV method 50 140 Plotting RMSE as a function of M shows us that we are able to beat the performance of the conventional imager. This is illustrated in Fig. 6.11. At low SNR and for smaller measurements we see a wide gap in performance between all sensing matrices and the conventional imager. Note that the conventional imager always makes 720 ∗ 480 = 345600 measurements. As the number of measurements increase the performance of all the sensing matrices begins to degrade. The rate of this degradation is a function of SNR. It slows down as SNR increases. When the SNR is high there is no advantage to be obtained from better utilizing the energy as there is enough energy for all measurements, and the conventional imager performs the best. Finally we note that sensing matrices using a priori information perform better than GRP. Looking at the RMSE vs. SNR performance of GRP we see that it has the most error at all SNRs and for any number of measurements. At the same time RMSE vs. M plots show that the rate at which the GRP performance degrades is fastest amongst all sensing matrices. Increasing block size leads to improved performance for fewer number of measurements as a fraction of the total. Unlike the ℓ2 -based method here we do not suffer from the stationarity assumption. In Fig. 6.12 we plot the performance of the TV method using the WDPC sensing matrix for 8 × 8 and 32 × 32 block sizes. We see that the rate of performance degradation is significantly reduced for the 32 × 32 block sizes. The improved performance is mainly because the sparsity condition 141 Number of Measurements per Block, M = 1 10 TV BPDN Diff 9 8 RMSE 7 6 5 4 3 2 1 0 −30 −20 −10 0 10 20 30 40 50 SNR (in dB) Figure 6.10: WDPC sensing matrix RMSE vs. SNR curves for the three ℓ1 -based difference image estimation methods. Block size 8 × 8 and M = 1. holds better for larger block sizes. If the block size is very small then even a sparse image might not be sparse within that block. However, with increasing block size we achieve sparsity and as a result get better performance. This trend of improved performance with increasing block size augurs well for future practical implementation of our proposed strategies. A practical system based on the FS imagers discussed in Section 2.2, with the capability to handle different sensing matrices, could optically measure the entire scene at very fast speeds. An optical system would not face the same limitations that we have in terms of simulating measurements due to a large scene. Subdividing images into smaller blocks serves as an important tool for studying the performance and efficacy of different estimation methods. This has been the primary goal of this work. 142 SNR (in dB) = −10 10 9 9 8 8 7 7 6 6 RMSE RMSE SNR (in dB) = −20 10 5 4 PC PC WPC WDPC GRP Conventional 3 2 1 0 0 5 10 15 20 PC PC WPC WDPC GRP Conventional 5 4 3 2 1 25 0 30 0 Number of Measurements per Block (M) 5 (a) SNR = -20 dB SNR (in dB) = 0 20 25 30 SNR (in dB) = 40 10 PC PC WPC WDPC GRP Conventional 9 8 7 8 7 6 5 4 6 5 4 3 3 2 2 1 1 0 5 10 15 20 25 Number of Measurements per Block (M) (c) SNR = 0 dB PC PC WPC WDPC GRP Conventional 9 RMSE RMSE 15 (b) SNR = -10 dB 10 0 10 Number of Measurements per Block (M) 30 0 0 5 10 15 20 25 30 Number of Measurements per Block (M) (d) SNR = 40 dB Figure 6.11: RMSE vs. number of measurements per block, for 8 × 8 blocks, for the TV method 143 SNR (in dB) = −20 16 WDPC (32 by 32 block) DPC (32 by 32 block) WDPC (8 by 8 block) DPC (8 by 8 block) Conventional 14 RMSE 12 10 8 6 4 0 5 10 15 20 25 30 Number of Measurements per Block (M) (a) M = 1 Figure 6.12: Performance comparison between ℓ1 -based TV method using 8 × 8 and 32 × 32 blocks. 144 CHAPTER 7 CONCLUSION AND FUTURE WORK In this dissertation we presented novel advancements in feature-specific (FS) imagers in large field-of-view surveillence and estimation of temporal object scene changes utilizing the compressive imaging paradigm. We first presented a novel technique to continuously track targets in a large FOV without conventional image reconstruction. The method is based on optical multiplexing of encoded sub-FOVs to create superposition space data that can be used to decode target positions in object space. We proposed a class of low complexity multiplexed imagers to perform optical encoding and showed that they can be light and cheap with simple designs in comparison to wide-FOV conventional imagers. We discussed different encoding schemes based on spatial shifts, rotations, and magnification with special emphasis on 1-D spatial shift encoding. We showed, based on both simulation and experimental data, that the proposed method does indeed localize targets in object space and provides continuous target tracking capability. We have also studied the trade-offs between area coverage efficiency, compression ratio, and decoding time, and decoding error as a function of shift resolution and SNR. This study of trade-offs raised the important idea of finding optimal encodings, which 145 needs further investigation. We briefly touched on this in Section 4.2. We showed that the approximate linear relationship between decoding time and area coverage efficiency precludes an optimal overlap in the spatial-encoding scheme. However, it might be possible to develop a hybrid encoding scheme employing a combination of spatial, rotational and magnification encodings. We briefly discussed a possible approach in Section 4.2. There might also be other approaches that need to be investigated. In the second part of the dissertation we applied the FS paradigm to perform difference imaging. We presented various FS sensing matrices for noise absent and noise present cases. These sensing matrices were used to take compressive measurements of the object scene. In conjuction with this compressive measurement scheme we presented ℓ2 - and ℓ1 -based techniques for estimating a sequence of difference images from the sequence of compressive measurements . We also presented qualitative and quantitative results to attest that both techniques are able to successfully estimate the difference images within the FS compressive imaging paradigm. Each technique has its advantage: ℓ2 -based techniques give closed-form expressions for the linear estimation operators that are easy to compute. On the other hand, ℓ1 -based methods exploit the natural sparsity of the difference image. Within ℓ2 -based techniques we looked at multi-step DDIE and LFGDIE methods to directly reconstruct the difference image from the compressive measurements. The ℓ2 -based techniques’ 146 use of spatio-temporal correlation matrices requires the assumption of wide-sense stationarity, which seldom holds in practice with the possible exception of texture images. Despite this, there is considerable literature on using second-order statistics to perform various image processing and imaging tasks [80], [81], [82], [83], [4]. We too have empirically shown that our proposed techniques yield good performance. Methods have also been suggested for transforming non-stationary images to exhibit stationary characteristics [84], [85]. Within the compressive imaging paradigm incorporating non-stationarity into the FS imager would require a sensing matrix update model that evolves with the temporal object scene. To develop such a model, however, requires a method to associate the object scene with the measurements. Due to the structure that the sensing matrix possesses, however, such an association remains a subject of potential research work. For the ℓ1 -based estimation problem, we looked at three different approaches to the linear inverse problem and compared their performance. We found that the modified TV method performs the best although the method that enforced the sparsity condition performs only slightly worse. Lastly we found that WDPC sensing matrix had the lowest RMSE for both ℓ2 - and ℓ1 -based methods, although for the latter WPC did equally well. The performance of all sensing matrices which utilized a priori information was better than the non-adaptive GRP sensing matrix. In fact, from a practical perspective, depending on the SNR and the number of 147 measurements that can be taken, any one of them can be used to provide a decent input to a tracker. 148 APPENDIX A DERIVATIONS FOR NOISE ABSENT CASE In this appendix we derive the difference image estimation operators and the ℓ2 optimal sensing matrix for the “noise absent” case. Let x1 and x2 be the object scene at two consecutive time instants. Let us also define the true difference image as ∆x1 = x2 − x1 . (A.1) The signal model for the “noise absent” case is given by y1 = x1 , (A.2) y2 = Φ2 x2 , (A.3) where Φ2 is sensing matrix we want to design. Based on the signal model, the estimate of the difference image is given by ∆x̂ = W1 y1 + W2 y2 , (A.4) where W1 and W2 are the difference image estimation operators. As the first step, we express the estimation operators in terms of the sensing matrix. We then use those expressions to compute the ℓ2 -optimal sensing matrix. 149 Toward that end, we minimize the BMSE J(W1 , W2 ) = E[||∆x − ∆x̂||2ℓ2 ], (A.5) J(W1 , W2 ) = E[tr((∆x − ∆x̂)(∆x − ∆x̂)T )]. (A.6) re-written as On substituting (A.4) in (A.6), differentiating the result with respect to W1 and W2 , and equating both derivatives to zero, we get W1 = (R21 − R11 − W2 Φ2 R21 )R−1 11 , (A.7) W2 = (R22 − R12 − W1 R12 )R−1 22 , (A.8) T T T where R11 = E[x1 xT 1 ], R12 = E[x1 x2 ] = R21 and R22 = E[x2 x2 ]. On further simplification we can write the estimation operators as T −1 −1 W1 = (I − Rδ ΦT 2 (Φ2 Rδ Φ2 ) Φ2 )R21 R11 − I, T −1 W 2 = R δ ΦT 2 (Φ2 Rδ Φ2 ) , (A.9) (A.10) where I is an N × N identity matrix and Rδ = R22 − R21 R−1 11 R12 . Substituting (A.9), (A.10) and (A.4) in (A.6) we get T −1 T T −1 T J(W1 , W2 ) = tr((Rδ ΦT 2 (Φ2 Rδ Φ2 ) Φ2 − I)Rδ (Rδ Φ2 (Φ2 Rδ Φ2 ) Φ2 ) ). (A.11) Differentiating (A.11) with respect to Φ2 and equating the derivative to zero, we get T −1 Φ2 R2δ ΦT 2 (Φ2 Rδ Φ2 ) Φ2 = Φ2 Rδ . (A.12) 150 For the noise absent case, the ℓ2 -optimal matrix decomposition of Rδ is given by its eigendecompostion. The decomposition is Rδ = QDQT , where D is the diagonal matrix of eigenvalues arranged in a descending order with the corresponding eigenvectors as the columns of Q. Substituting this eigendecompostion in (A.12) and denoting Φ2 Q by Z, we get ZD2 ZT (ZDZT )−1 Φ2 = ZDQT , ZD2 ZT (ZDZT )−1 Φ2 Q = ZD. (∵ Q is a unitary matrix) (A.13) Denoting ZD2 ZT (ZDZT )−1 by S we re-write (A.13) as SZ = ZD. (A.14) Note, that S is an M × M matrix and therefore, there are only M eigenvectors. Consequently, N − M columns of Z must be zero vectors. Let us denote Z by [X 0] where M matrix and 0 is an M × (N − M) matrix. Let us also write D X is an M × 0 DM , where DM and DN −M are M × M and (N − M) × (N − M) as 0 DN −M diagonal matrices each with M and N − M eigenvalues respectively. If we assume that Φ2 has full row rank of M, and X is an invertible matrix, then (A.13) is satisfied. Therefore, using Z = Φ2 Q, the sensing matrix is given by Φ2 = [X 0]QT . 151 APPENDIX B LFGDIE AND DDIE ESTIMATION OPERATORS We first derive the difference image estimation operators for LFGDIE method, and then find the estimation operators for multi-step DDIE method as a special case of LFGDIE method. The BMSE between ∆x1L and ∆x̂1L is J(W1′ , Wp′ , WL′ ) = E[||∆x1L − ∆x̂1L ||2ℓ2 ], (B.1) where ∆x1L = xL − x1 and ∆x̂1L = [W′ 1 W′ p W′ L ][y1T ypT yLT ]T as shown in (5.27). We then re-write (B.1) as J(W1′ , Wp′ , WL′ ) = E[tr((∆x1L − ∆x̂1L )(∆x1L − ∆x̂1L )T )], (B.2) T = E[tr((∆x1L ∆xT 1L ) − (∆x1L ∆x̂1L ) T − (∆x̂1L ∆xT 1L ) + (∆x̂1L ∆x̂1L ))]. (B.3) Note, that with the exception of the first term in (B.3), the last three depend on 152 W1′ , Wp′ , WL′ . Their explicit dependence is respectively given by T T ′ T ′ Term 1: E[tr(∆x1L ∆x̂T 1L )] = tr((RL1 − R11 )W 1 + (RLp − R1p )Φp W p T ′ + (RLL − R1L )ΦT L W L ), (B.4) ′ ′ Term 2: E[tr(∆x̂1L ∆xT 1L )] = tr(W 1 (R1L − R11 ) + W p Φp (RpL − Rp1 ) + W′ L ΦL (RLL − R1L )), (B.5) = Term 1, (∵ tr(GT ) = tr(G)) (B.6) T T ′ T ′ ′ ′ Term 3: E[(∆x̂1L ∆x̂T 1L )] = tr(W 1 R11 W 1 + W 1 R1p Φp W p T T ′ ′ ′ + W′1 R1L ΦT L W L W 1 R n1 W 1 T T ′ + W′p Φp Rp1W′ 1 + W′ p Φp Rpp ΦT pWp T T ′ ′ ′ + W′p Φp RpL ΦT L W L + W p R np W p T T ′ + W′L ΦL RL1 W′1 + W′ L ΦL RLp ΦT pWp T T ′ ′ ′ + W′L ΦL RLL ΦT L W L + W L RnL W L ). (B.7) Substituting these three terms in (B.3), differentiating with respect to W1′ , Wp′ and WL′ , and setting the derivatives equal to zero, we get W1′ = (RL1 − R11 − W′ p Φp Rp1 − W′L ΦL RL1 )(R11 + Rn1 )−1 , (B.8) −1 T Wp′ = (RLp − R1p − W′1 R1p − W′ L ΦL RLp )ΦT p (Φp Rpp Φp + Rnp ) , (B.9) T −1 WL′ = (RLL − R1L − W′1 R1L − W′p Φp RpL )ΦT L (ΦL RLL ΦL + RnL ) . (B.10) On simplifying (B.8), (B.9) and (B.10) we get the equations (5.29) through (5.39). 153 To get the expressions (5.13) through (5.16) for the multi-step DDIE method we simply set L = 2, resulting in p = 0, RLγL = Rα , RαL = Rα − Rβ , W′ p = 0, W′ 1 = W1 and W′ L = W2 . Note, that with p = 0 all correlation matrices with p in the subscript go to 0. 154 APPENDIX C WATERFILLING SOLUTION We first calculate the maximum mutual information between x and y given the sensing matrix Φwpc , and then use it to compute the weights wi , i = 1, . . . , M. I(x; y) = h(y) − h(y|x), (elements of x and y take on a continuum of values.) = h(y) − h(n), M |Ry | , (log is w.r.t base 2), = log (2 ∗ π ∗ e) |Rn | |Ry | M = log (2 ∗ π ∗ e) + log , |Rn| (C.1) (C.2) where h represents differential entropy and | · | denotes the determinant of the correlation matrices Rx and Ry . Note that entropy is maximized when, for a given covariance, y is a multi-variate Gaussian. This is achieved for a multi-variate Gaussian input. (We already know that noise is AWGN.) In our case, we know that the object scene is not Gaussian and so we will get a sub-optimal solution. Yet, as shown in Chapter 6, we do see the benefit of adapting the principal components according to the given SNR. Here we complete the proof assuming the optimal scenario of a multi-variate Gaussian input. 155 From (C.2), the problem of maximizing the mutual information is reduced to |Ry | maximizing log |Rn | . We have log |Ry | |Rn | |Φwpc Rx ΦT wpc + λn IM | |λn IM | = log = log ! , (Here λn is the noise variance σn2 ), |Dw (U(1 : N, 1 : M))T Rx (U(1 : N, 1 : M))Dw T + λn IM | |λn IM | , (∵ Φwpc = Dw (U(1 : N, 1 : M))T ), = log |Dw (U(1 : N, 1 : M))T UΛUT (U(1 : N, 1 : M))Dw + λn IM | |λn IM | , (∵ Rx = UΛUT , Dw T = Dw ), |Dw I(M,N ) ΛI(N,M ) Dw + λn IM | , (I(M,N ) = [IM |0(M,N −M ) ]), = log |λn IM | |Dw ΛM Dw + λn IM | = log , |λn IM | (ΛM is a diagonal matrix with the first M eigenvalues), 2 M w i λi + λn = log Πi=1 , λn 2 w i λi + λn M . (C.3) = Σi=1 log λn Our goal now has further reduced to maximizing (C.3). But we have a constraint √ 2 N given by ΣM i=1 wi = Σi=1 λi = E. What this constraint says is that the norm of the weights must not exceed the total energy available.1 Using lagrange multipliers our 1 We perform non-coherent imaging. 156 objective is written as 2 J(w12 , . . . , wM ) wi2 λi + λn 2 − ζ(ΣM = log i=1 wi − E), λn 2 w i λi + λn E M 2 = Σi=1 log − ζ(wi − ) . λn M ΣM i=1 (C.4) From (C.4) we see that we can maximize each summation term individually. 2 w λ +λ To maximize the ith (i = 1, . . . , M) term, we differentiate J(wi2 ) = log i λin n − ζ(wi2 − E ) M with respect to wi2 to get wi2 = 1 λn − . ζ λi (C.5) Note that wi2 cannot be negative. Therefore, we can rewrite (C.4) as wi2 = max[0, 1 λn − ], ζ λi (C.6) or + 1 λn = − . ζ λi + λn We choose the value of ζ such that ΣM ζ − = E. i=1 λi wi2 (C.7) 157 REFERENCES [1] D. Whitehouse, “Worldś oldest telescope?” [Online]. http://news.bbc.co.uk/2/hi/science/nature/380186.stm Available: [2] D. J. Brady, Optical Imaging and Spectroscopy. John Wiley & sons, 2009. [3] S. Tucker, W. T. Cathey, and J. Edward Dowski, “Extended depth of field and aberration control for inexpensive digital microscope systems,” Opt. Express, vol. 4, no. 11, pp. 467–474, 1999. [Online]. Available: http://www.opticsexpress.org/abstract.cfm?URI=oe-4-11-467 [4] H. S. Pal, D. Ganotra, and M. A. Neifeld, “Face recognition by using feature-specific imaging,” Appl. Opt., vol. 44, no. 18, pp. 3784–3794, 2005. [Online]. Available: http://ao.osa.org/abstract.cfm?URI=ao-44-18-3784 [5] M. A. Neifeld and P. Shankar, “Feature-specific imaging,” Appl. Opt., vol. 42, no. 17, pp. 3379–3389, 2003. [Online]. Available: http://ao.osa.org/abstract.cfm?URI=ao-42-17-3379 [6] S. Prasad, “Information capacity of a seeing-limited imaging system,” Optics Communications, vol. 177, no. 1-6, pp. 119 – 134, 2000. [7] E. Clarkson and H. H. Barrett, “Approximations to ideal-observer performance on signal-detection tasks,” Appl. Opt., vol. 39, no. 11, pp. 1783–1793, 2000. [Online]. Available: http://ao.osa.org/abstract.cfm?URI=ao-39-11-1783 [8] W.-C. Chou, M. A. Neifeld, and R. Xuan, “Information-based optical design for binary-valued imagery,” Appl. Opt., vol. 39, no. 11, pp. 1731–1742, 2000. [Online]. Available: http://ao.osa.org/abstract.cfm?URI=ao-39-11-1731 [9] A. Ashok and M. Neifeld, “Information-based analysis of simple incoherent imaging systems,” Opt. Express, vol. 11, no. 18, pp. 2153–2162, 2003. [Online]. Available: http://www.opticsexpress.org/abstract.cfm?URI=oe-11-18-2153 [10] M. P. Christensen, G. W. Euliss, M. J. McFadden, K. M. Coyle, P. Milojkovic, M. W. Haney, J. van der Gracht, and R. A. Athale, “Active-eyes: an adaptive pixel-by-pixel image-segmentation sensor architecture for high-dynamic-range hyperspectral imaging,” Appl. Opt., vol. 41, no. 29, pp. 6093–6103, 2002. [Online]. Available: http://ao.osa.org/abstract.cfm?URI=ao-41-29-6093 158 [11] D. L. Marks, R. Stack, A. J. Johnson, D. J. Brady, and D. C. Munson, “Cone-beam tomography with a digital camera,” Appl. Opt., vol. 40, no. 11, pp. 1795–1805, 2001. [Online]. Available: http://ao.osa.org/abstract.cfm?URI=ao-40-11-1795 [12] Z. Liu, M. Centurion, G. Panotopoulos, J. Hong, and D. Psaltis, “Holographic recording of fast events on a ccd camera,” Opt. Lett., vol. 27, no. 1, pp. 22–24, 2002. [Online]. Available: http://ol.osa.org/abstract.cfm?URI=ol-27-1-22 [13] P. Potuluri, M. Fetterman, and D. Brady, “High depth of field microscopic imaging using an interferometric camera,” Opt. Express, vol. 8, no. 11, pp. 624–630, 2001. [Online]. Available: http://www.opticsexpress.org/abstract.cfm?URI=oe-8-11-624 [14] P. K. Baheti and M. A. Neifeld, “Feature-specific structured imaging,” Appl. Opt., vol. 45, no. 28, pp. 7382–7391, 2006. [Online]. Available: http://ao.osa.org/abstract.cfm?URI=ao-45-28-7382 [15] M. A. Neifeld and J. Ke, “Optical architectures for compressive imaging,” Applied Optics, vol. 46, pp. 5293–5303, 2007. [16] M. B. Wakin, J. N. Laska, M. F. Duarte, D. Baron, S. Sarvotham, D. Takhar, K. F. Kelly, and R. G. Baraniuk, “An architecture for compressive imaging,” in IEEE International Conference on Image Processing, 2006, pp. 1273–1276. [17] E. J. Candès and T. Tao, “The dantzig selector: statistical estimation when p is much larger than n,” Annals of Statistics, vol. 35, pp. 2313–2351, 2005. [Online]. Available: http://www-stat.stanford.edu/∼candes/papers/DantzigSelector.pdf [18] ——, “Decoding by linear programming,” IEEE Transactions on Information Theory, vol. 51, no. 12, pp. 4203–4215, 2005. [Online]. Available: http://doi.ieeecomputersociety.org/10.1109/TIT.2005.858979 [19] A. C. Gilbert, S. Muthukrishnan, and M. Strauss, “Improved time bounds for near-optimal sparse fourier representations,” in in Proc. SPIE Wavelets XI, M. Papadakis, A. F. Laine, and M. A. Unser, Eds., vol. 5914, 2003. [20] E. J. Candès, J. K. Romberg, and T. Tao, “Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information,” IEEE Transactions on Information Theory, vol. 52, no. 2, pp. 489–509, 2006. [Online]. Available: http://doi.ieeecomputersociety.org/10.1109/TIT.2005.862083 159 [21] E. J. Candès and T. Tao, “Near-optimal signal recovery from random projections: Universal encoding strategies?” IEEE Transactions on Information Theory, vol. 52, no. 12, pp. 5406–5425, 2006. [Online]. Available: http://doi.ieeecomputersociety.org/10.1109/TIT.2006.885507 [22] E. J. Candès, J. K. Romberg, and T. Tao, “Signal recovery from incomplete and inaccurate measurements,” Comm. Pure Appl. Math., vol. 59, no. 8, pp. 1207–1223, 2005. [23] E. J. Candès and J. K. Romberg, “Quantitative robust uncertainty principles and optimally sparse decompositions,” Foundations of Computational Mathematics, vol. 6, no. 2, pp. 227–254, 2006. [Online]. Available: http://dx.doi.org/10.1007/s10208-004-0162-x [24] D. L. Donoho, “Compressed sensing,” Sept. 2004. [25] D. L. Donoho and X. Huo, “Uncertainty principles and ideal atomic decomposition,” IEEE Transactions on Information Theory, vol. 47, no. 7, pp. 2845–2862, 2001. [26] D. L. Donoho and M. Elad, “Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ1 -minimization,” Proc. Natl. Acad. Sci. USA, vol. 100, pp. 2197–2202, 2003. [27] M. Elad and A. M. Bruckstein, “A generalized uncertainty principle and sparse representation in pairs of bases,” IEEE Transactions on Information Theory, vol. 48, no. 9, pp. 2558–2567, 2002. [28] R. Gribonval and M. Nielsen, “Sparse representations in unions of bases,” IEEE Transactions on Information Theory, vol. 49, no. 12, pp. 3320–3325, 2003. [29] A. Feuer and A. Nemirovski, “On sparse representation in pairs of bases,” Information Theory, IEEE Transactions on, vol. 49, no. 6, pp. 1579 – 1581, june 2003. [30] J. Fuchs, “On sparse representations in arbitrary redundant bases,” Information Theory, IEEE Transactions on, vol. 50, no. 6, pp. 1341 – 1344, june 2004. [31] D. L. Donoho and I. M. Johnstone, “Minimax estimation via wavelet shrinkage,” Ann. Statist., vol. 26, no. 3, pp. 879–921, 1998. [32] A. Zandi, J. D. Allen, E. L. Schwartz, , and M. Boliek, “CREW: Compression with reversible embedded wavelets,” in Proc. of IEEE Data Compression Conference, March 1995, pp. 212–221. 160 [33] M. Boliek, M. Gormish, E. L. Schwartz, and A. F. Keith, “Decoding compression with reversible embedded wavelets (CREW) codestreams,” Journal of Electronic Imaging, vol. 7, no. 3, pp. 402–409, 1998. [34] M. Lustig, D. L. Donoho, and J. M. Pauly, “Sparse MRI: The application of compressed sensing for rapid MR imaging,” Magnetic Resonance in Medicine, 2007. [35] M. Herman and T. Strohmer, “High-resolution radar via compressed sensing,” Signal Processing, IEEE Transactions on, vol. 57, no. 6, pp. 2275 –2284, june 2009. [36] M. Elad, “Optimized projections for compressed sensing,” IEEE Trans. Signal Process., vol. 55, no. 12, pp. 5695–5702, 2007. [37] J. M. D. Carvajalino and G. Sapiro, “Learning to sense sparse signals: simultaneous sensing matrix and sparsifying dictionary optimization,” IEEE Trans. Image Process., vol. 18, no. 7, pp. 1395–1408, 2009. [38] D. W. Xue and Z. M. Lu, “Difference image watermarking based reversible image authentication with tampering localization capability,” International Journal of Computer Sciences and Engineering Systems, vol. 2, pp. 219–226, 2008. [39] S. K. Lee, Y. H. Suh, and Y. S. Ho, “Reversible image authentication based on watermarking,” in IEEE International Conference on Multimedia & Expo, 2006, pp. 1321–1324. [40] P. B. Heffernan and R. A. Robb, “Difference image reconstruction from a few projections for nondestructive materials inspection,” Applied Optics, vol. 24, pp. 4105–4110, 1985. [41] S. G. Kong, “Classification of interframe difference image blocks for video compression,” in Proceedings of SPIE, vol. 4668, 2002, pp. 29–37. [42] V. Cehver, A. Sankaranarayanan, M. F. Duarte, D. Reddy, R. G. Baraniuk, and R. Chellappa, “Compressive sensing for background subtraction,” in Proc. 10th European Conf. Comp. Vision, 2008, pp. 155–168. [43] D. Hahn, V. Daum, J. Hornegger, W. Bautz, and T. Kuwert, “Difference imaging of inter- and intra-ictal spect images for the localization of seizure onset in epilepsy,” in IEEE Nuclear Science Symposium Conference Record, 2007, pp. 4331–4335. 161 [44] C. Alcock, R. A. Allsman, D. Alves, T. S. Axelrod, A. C. Becker, D. P. Bennett, K. H. Cook, A. J. Drake, K. C. Freeman, K. Griest, M. J. Lehner, S. L. Marshall, D. Minniti, B. A. Peterson, M. R. Pratt, P. J. Quinn, C. W. Stubbs, W. Sutherland, A. Tomaney, T. Vandehei, and D. L. Welch, “Difference image analysis of galactic microlensing. i. data analysis,” Astrophys. J., vol. 521, pp. 602–612, 1999. [45] L. Bruzzone and D. F. Prieto, “Automatic analysis of the difference image for unsupervised change detection,” IEEE Trans. Geosci. Remote Sens., vol. 8, pp. 1171–1182, 2000. [46] S. M. Kay, Fundamentals of Statistical Processing, Volume I: Estimation Theory. Prentice Hall, 1993. [47] J. I. ’t Zand, Ph.D. dissertation, University of Utrecht, 1992. [48] E. E. Fenimore and T. M. Cannon, “Coded aperture imaging with uniformly redundant arrays,” Appl. Opt., vol. 17, no. 3, pp. 337–347, 1978. [Online]. Available: http://ao.osa.org/abstract.cfm?URI=ao-17-3-337 [49] E. E. Fenimore, “Time-resolved and energy-resolved coded aperture images with ura tagging,” Appl. Opt., vol. 26, no. 14, pp. 2760–2769, 1987. [Online]. Available: http://ao.osa.org/abstract.cfm?URI=ao-26-14-2760 [50] M. Sims, M. Turner, and R. Willingale, “Wide field x-ray camera,” Space Science Instrumentation, vol. 5, pp. 109–127, 1980. [51] R. Willingale, M. Sims, and M. Turner, “Advanced deconvolution techniques for coded aperture imaging,” Nucl. Instr. Methods Phys. Res., vol. 60, p. 221, 1984. [52] B. R. Frieden and J. J. Burke, “Restoring with maximum entropy, ii: Superresolution of photographs of diffraction-blurred impulses,” J. Opt. Soc. Am., vol. 62, no. 10, pp. 1202–1210, 1972. [Online]. Available: http://www.opticsinfobase.org/abstract.cfm?URI=josa-62-10-1202 [53] G. Daniell, “Image restoration and processing methods,” Nucl. Instr. Methods Phys. Res., vol. 67, p. 221, 1984. [54] S. R. Gottesman and E. E. Fenimore, “New family of binary arrays for coded aperture imaging,” Appl. Opt., vol. 28, no. 20, pp. 4344–4352, 1989. [Online]. Available: http://ao.osa.org/abstract.cfm?URI=ao-28-20-4344 162 [55] M. D. Stenner, P. Shankar, and M. A. Neifeld, “Wide-field feature-specific imaging,” Frontiers in Optics, 2007. [56] D. J. Brady, “Micro-optics and megapixels,” Optics and Photonics News, vol. 17, pp. 24–29, 2006. [57] D. Du and F. Hwang, Combinatorial group testing and its applications, ser. Series on Applied Mathematics. World Scientific, 2000, vol. 12. [58] C. M. Brown, “Multiplex imaging with random arrays,” Ph.D. dissertation, Univ. of Chicago, 2000. [59] D. J. Brady and M. E. Gehm, “Compressive imaging spectrometers using coded apertures,” Proc. SPIE, vol. 6246, 2006. [60] R. H. Dicke, “Scatter-hole cameras for x-rays and gamma rays,” Astrophys J., vol. 153, pp. L101–L106, 1968. [61] S. Uttam, N. A. Goodman, M. A. Neifeld, C. Kim, R. John, J. Kim, and D. Brady, “Optically multiplexed imaging with superposition space tracking,” Opt. Express, vol. 17, no. 3, pp. 1691–1713, 2009. [Online]. Available: http://www.opticsexpress.org/abstract.cfm?URI=oe-17-3-1691 [62] A. Biswas, P. Guha, A. Mukerjee, and K. S. Venkatesh, “Intrusion detection and tracking with pan-tilt cameras,” IET Intl. Conf. VIE 06, pp. 565–571, 2006. [63] A. W. Senior, A. Hampapur, and M. Lu, “Acquiring multi-scale images by pan-tilt-zoom control and automatic multi-camera calibration,” WACV/MOTION’05, pp. 433–438, 2005. [64] E. Voigtman and J. D. Winefordner, “The multiplex disadvantage and excess low-frequency noise,” Appl. Spectrosc., vol. 41, no. 7, pp. 1182–1184, 1987. [Online]. Available: http://as.osa.org/abstract.cfm?URI=as-41-7-1182 [65] C. R. Wren, A. Azarbayejani, T. J. Darrell, and A. P. Pentland, “Pfinder: Realtime tracking of the human body,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 19, pp. 780–785, 1997. [66] N. Friedman and S. J. Russell, “Image segmentation in video sequences:a probabilistic approach,” Proc. Uncertainty Artif. Intell. Conf., pp. 175–181, 1997. [67] C. Stauffer and E. L. Grimson, “Learning patterns of activity using real-time tracking,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 22, pp. 747–757, 2000. 163 [68] A. Mittal and N. Paragios, “Motion-based background subtraction using adaptive kernel density estimation,” Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 302–309, 2004. [69] K. P. Karmann and A. Brandt, Moving Object Recognition Using an Adaptive Background Memory, V. Cappellini, Ed. Elsevier, 1990. [70] P. Jodoin, M. Mignotte, and J. Konrad, “Statistical background subtraction using spatial cues,” IEEE Circuits Syst. Video Technol., vol. 17, pp. 1758–1763, 2007. [71] R. Singh, “Advanced correlation filters for multi-class synthetic aperture radar detection and classification,” Master’s thesis, Carnegie Mellon University, 2002. [72] M. Alkanhal and B. V. K. V. Kumar, “Polynomial distance classifier correlation filter for pattern recognition,” Appl. Opt., vol. 42, pp. 4688–4708, 2003. [73] B. V. K. V. Kumar, D. W. Carlson, and A. Mahalanobis, “Optimal trade-off synthetic discriminant function filters for arbitrary devices,” Optics Letters, vol. 19, pp. 1556–1558, 1994. [74] B. V. K. V. Kumar, “Minimum-variance synthetic discriminant functions,” J. Opt. Soc. Am. A, vol. 3, pp. 1579–1584, 1986. [75] T. M. Cover and J. A. Thomas, Elements of Information Theory. John Wiley & Sons Inc., 2006. [76] W. Wenzel and K. Hamacher, “A stochastic tunneling approach for global minimization,” Phys. Rev. Lett., vol. 82, pp. 3003–3007, 1999. [77] S. Osher, L. Rudin, and E. Fatem, “Nonlinear total variation based noise removal algorithms,” Physica D., vol. 60, pp. 259–268, 1992. [78] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM Review, vol. 43, pp. 129–159, 2001. [79] I. Daubechies, Ten Lectures on Wavelets. SIAM, 1992. [80] L. Malagón-Borja and O. Fuentes, “An object detection system using image reconstruction with pca,” Computer and Robot Vision, Canadian Conference, pp. 2–8, 2005. [81] M. Turk and A. Pentland, “Face recognition using eigenfaces,” IEEE Conf. on Computer Vision and Pattern Recognition, pp. 586–591, 1991. 164 [82] A. Pentland, B. Moghaddam, and T. Starner, “View-based and modular eigenspaces for face recognition,” MIT, Tech. Rep., 1994. [83] H. Moon and P. Phillips, “Computational and performance aspects of pca-based face recognition algorithms,” Perception, vol. 10, pp. 303–321, 2001. [84] A. Hillery. and R. Chin, “Restoration of images with nonstationary mean and autocorrelation,” in ICASSP-88, vol. 2, Apr 1988, pp. 1008–1011. [85] R. N. Strickland, “Transforming images into block stationary behavior,” Appl. Opt., vol. 22, no. 10, pp. 1462–1473, 1983.

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement