New Analysis of Manifold Embeddings and Signal Recovery from Compressive Measurements Armin Eftekhari and Michael B. Wakin∗ arXiv:1306.4748v4 [cs.IT] 1 May 2014 Department of Electrical Engineering and Computer Science, Colorado School of Mines June 2013; Revised April 2014 Abstract Compressive Sensing (CS) exploits the surprising fact that the information contained in a sparse signal can be preserved in a small number of compressive, often random linear measurements of that signal. Strong theoretical guarantees have been established concerning the embedding of a sparse signal family under a random measurement operator and on the accuracy to which sparse signals can be recovered from noisy compressive measurements. In this paper, we address similar questions in the context of a different modeling framework. Instead of sparse models, we focus on the broad class of manifold models, which can arise in both parametric and non-parametric signal families. Using tools from the theory of empirical processes, we improve upon previous results concerning the embedding of low-dimensional manifolds under random measurement operators. We also establish both deterministic and probabilistic instanceoptimal bounds in `2 for manifold-based signal recovery and parameter estimation from noisy compressive measurements. In line with analogous results for sparsity-based CS, we conclude that much stronger bounds are possible in the probabilistic setting. Our work supports the growing evidence that manifold-based models can be used with high accuracy in compressive signal processing. Keywords. Manifolds, Compressive Sensing, dimensionality reduction, random projections, manifold embeddings, signal recovery, parameter estimation. AMS Subject Classification. 53A07, 57R40, 62H12, 68P30, 94A12, 94A29. 1 Introduction 1.1 Concise signal models A significant byproduct of the Information Age has been an explosion in the sheer quantity of raw data demanded from sensing systems. From digital cameras to mobile devices, scientific computing to medical imaging, and remote surveillance to signals intelligence, the size (or dimension) N of a typical desired signal continues to increase. Naturally, the dimension N imposes a direct burden on the various stages of the data processing pipeline, from the data acquisition itself to the subsequent transmission, storage, and/or analysis. Fortunately, in many cases, the information contained within a high-dimensional signal actually obeys some sort of concise, low-dimensional model. Such a signal may be described as having just ∗ Email: aeftekha,[email protected] This work was partially supported by NSF grant DMS-0603606, DARPA grant HR0011-08-1-0078, NSF grant CCF-0830320, and NSF CAREER grant CCF-1149225. 1 K N degrees of freedom for some K. Periodic signals bandlimited to a certain frequency are one example; they live along a fixed K-dimensional linear subspace of RN . Piecewise smooth signals are an example of sparse signals, which can be written as a succinct linear combination of just K elements from some basis such as a wavelet dictionary. Still other signals may live along Kdimensional submanifolds of the ambient signal space RN ; examples include collections of signals observed from multiple viewpoints in a camera or sensor network. In general, the conciseness of these models suggests the possibility for efficient processing and compression of these signals. 1.2 Compressive measurements Recently, the conciseness of certain signal models has led to the use of compressive measurements for simplifying the data acquisition process. Rather than designing a sensor to measure a signal x ∈ RN , for example, it often suffices to design a sensor that can measure a much shorter vector y = Φx, where Φ is a linear measurement operator represented as an M × N matrix, and where typically M N . As we discuss below in the context of Compressive Sensing (CS), when Φ is properly designed, the requisite number of measurements M typically scales with the information level K of the signal, rather than with its ambient dimension N . Surprisingly, the requirements on the measurement matrix Φ can often be met by choosing Φ randomly from an acceptable distribution. Most commonly, the entries of Φ are chosen to be independent and identically distributed (i.i.d.) Gaussian random variables, although the use of structured random matrices is on the rise [25, 37]. Physical architectures have been proposed for hardware that will enable the acquisition of signals using compressive measurements [10, 22, 30, 36]; many of these collect the compressive measurements y of a signal x directly, without explicitly computing a matrix multiplication on board. The potential benefits for data acquisition are numerous. These systems can enable simple, low-cost acquisition of a signal directly in compressed form without requiring knowledge of the signal structure in advance. Some of the many possible applications include distributed source coding in sensor networks [23], medical imaging [40], and high-rate analog-to-digital conversion [10, 30, 57]. We note that, in all cases, the measurement matrix Φ must be known to any decoder that will be used to process the compressed measurement vector y, but with suitable synchronization between the compressive measurement system and the decoder (e.g., exchanging a seed used to initialize a random number generator), it it not necessary for Φ to be explicitly transmitted along with y. 1.3 Signal understanding from compressive measurements Having acquired a signal x in compressed form (in the form of a measurement vector y), there are many questions that may then be asked of the signal. These include: Q1. Recovery: What was the original signal x? Q2. Parameter estimation: Supposing x was generated from a K-dimensional parametric model, what was the original K-dimensional parameter that generated x? Given only the measurements y (possibly corrupted by noise), solving either of the above problems requires exploiting the concise, K-dimensional structure inherent in the signal.1 CS addresses questions Q1 and Q2 under the assumption that the signal x is K-sparse (or approximately so) in some basis or dictionary; in Section 2.1 we outline some key theoretical bounds from CS regarding the accuracy to which these questions may be answered. 1 Other problems, such as finding the nearest neighbor to x in a large database of signals [34], can also be solved using compressive measurements and do not require assumptions about the concise structure in x. 2 g(t−θ) t (a) 0 θ 1 (b) Figure 1: (a) The articulated signal fθ (t) = g(t − θ) is defined via shifts of a primitive function g, where g is a Gaussian pulse. Each signal is sampled at N points, and as θ changes, the resulting signals trace out a 1-D manifold in RN . (b) Projection of the manifold from RN into R3 via a random 3 × N matrix; the color/shading represents different values of θ ∈ [0, 1]. 1.4 Manifold models for signal understanding In this paper, we will address these questions in the context of a different modeling framework for concise signal structure. Instead of sparse models, we focus on the broad class of manifold models, which arise both in settings where a K-dimensional parameter θ controls the generation of the signal and also in non-parametric settings. As a very simple illustration, consider the articulated signal in Figure 1(a). We let g(t) be a fixed continuous-time Gaussian pulse centered at t = 0 and consider a shifted version of g denoted as the parametric signal fθ (t) := g(t − θ) with t, θ ∈ [0, 1]. We then suppose the discrete-time signal x = xθ ∈ RN arises by sampling the continuous-time signal fθ (t) uniformly in time, i.e., xθ (n) = fθ (n/N ) for n = 1, 2, . . . , N . As the parameter θ changes, the signals xθ trace out a continuous one-dimensional (1-D) curve M = {xθ : θ ∈ [0, 1]} ⊂ RN . The conciseness of our model (in contrast with the potentially high dimension N of the signal space) is reflected in the low dimension of the path M. In the real world, manifold models may arise in a variety of settings. A K-dimensional parameter θ could reflect uncertainty about the 1-D timing of the arrival of a signal (as in Figure 1(a); see also [24]), the 2-D orientation and position of an edge in an image, the 2-D translation of an image under study [46], the multiple degrees of freedom in positioning a camera or sensor to measure a scene [15], the physical degrees of freedom in an articulated robotic or sensing system, or combinations of the above. Manifolds have also been proposed as approximate models for signal databases such as collections of images of human faces or of handwritten digits [4, 33, 58]. Consequently, the potential applications of manifold models are numerous in signal processing. In some applications, the signal x itself may be the object of interest, and the concise manifold model may facilitate the acquisition or compression of that signal. Alternatively, in parametric settings one may be interested in using a signal x = xθ to infer the parameter θ that generated that signal. In an application known as manifold learning, one may be presented with a collection of data {xθ1 , xθ2 , . . . , xθn } sampled from a parametric manifold and wish to discover the underlying parameterization that generated that manifold. Multiple manifolds can also be considered simultaneously, for example in problems that require recognizing an object from one of n possible classes, where the viewpoint of the object is uncertain during the image capture process. In this case, we may wish to know which of n manifolds is closest to the observed image x. While any of these questions may be answered with full knowledge of the high-dimensional signal 3 x ∈ RN , there is growing theoretical and experimental support that they can also be answered from only compressive measurements y = Φx. In past work [3], we have shown that given a sufficient number M of random measurements, one can ensure with high probability that a manifold M ⊂ RN has a stable embedding in the measurement space RM under the operator Φ, such that pairwise Euclidean and geodesic distances are approximately preserved on its image ΦM. We will discuss this in more detail later in Section 3, but a key aspect is that the number of requisite measurements M is linearly proportional to the information level of the signal, i.e., the dimension K of the manifold. In that work, the number of measurements was also logarithmically dependent on the ambient dimension N , although this dependence was later removed in the asymptotic case in [12] using a different set of assumptions on the manifold. The first contribution of this paper—presented in Section 3—is that we provide an improved lower bound on the number of random measurements to guarantee a stable embedding of a signal manifold. In particular, we make the same assumptions on the manifold as in our past work [3] but provide a measurement bound that is independent of the ambient dimension N . Our bound is non-asymptotic, and we provide explicit constants. Additionally we point out that this result is generic in the sense that it applies to any compact and smooth submanifold of RN for which certain geometric properties (namely volume, dimension, and condition number) are known. In order to do this, we use tools from the theory of empirical processes (namely, the idea of “generic chaining” [56]), which have recently been used to develop state-of-the-art RIP results for structured measurement matrices in CS [24, 37, 48, 50–53, 57]. More elementary arguments (e.g., involving simple concentration of measure inequalities) have previously been used in CS (see, e.g., [2]) for deriving RIP bounds for unstructured i.i.d. measurement matrices, and we also used such arguments in [3] to derive a manifold embedding guarantee. However, it appears that the stronger machinery of the empirical process approach is necessary to derive stronger bounds, both in RIP problems and in manifold embedding problems. A chaining argument was employed in [12], and in this paper we present a chaining argument that is suitable for studying the manifold embedding problem under our set of assumptions on the manifold. Because this chaining framework is fairly technical, we develop it entirely in the appendices so that the body of the paper will be as self-contained and expository as possible for someone seeking merely to understand the substance and context of our results. (We do, however, include an example in the body of the paper to provide insight into the machinery developed in the appendices.) We also observe that similar results are attainable through the use of the Dudley inequality [39], but a direct argument (as the one presented here) has the advantages of potentially better exploiting the geometry of the model and therefore producing tighter bounds, offering improved insight into the problem, and being more amenable to future improvements to our arguments. As a very simple illustration of the embedding phenomenon, Figure 1(b) presents an experiment where just M = 3 compressive measurements are acquired from each point xθ described in Figure 1(a). We let N = 1024 and construct a randomly generated 3 × N matrix Φ whose entries are i.i.d. Gaussian random variables with zero mean and variance of 1/3. Each point xθ from the original manifold M ⊂ R1024 maps to a unique point Φxθ in R3 ; the manifold embeds in the lowdimensional measurement space. Given any y = Φxθ0 for θ0 unknown, then, it is possible to infer the value θ0 using only knowledge of the parametric model for M and the measurement operator Φ. Moreover, as the number M of compressive measurements increases, the manifold embedding becomes much more stable and remains highly self-avoiding. Indeed, there is strong evidence that, as a consequence of this phenomenon, questions such as Q1 (signal recovery) and Q2 (parameter estimation) can be accurately solved using only compressive measurements of a signal x, and that these procedures are robust to noise and to deviations of the signal x away from the manifold M [14, 54, 61]. Additional theoretical and empirical justifica4 tion has followed for the manifold learning [32] and multiclass recognition problems [14] described above. Consequently, many of the advantages of compressive measurements that are beneficial in sparsity-based CS (low-cost sensor design, reduced transmission requirements, reduced storage requirements, lack of need for advance knowledge of signal structure, simplified computation in the low-dimensional space RM , etc.) may also be enjoyed in settings where manifold models capture the concise signal structure. Moreover, the use of a manifold model can often capture the structure of a signal in many fewer degrees of freedom K than would be required in any sparse representation, and thus the measurement rate M can be greatly reduced compared to sparsity-based CS approaches. The second contribution of this paper—presented in Section 4—is that we establish theoretical bounds on the accuracy to which questions Q1 (signal recovery) and Q2 (parameter estimation) may be answered. To do this, we rely largely on the new analytical chaining framework described above. We consider both deterministic and probabilistic instance-optimal bounds, and we see strong similarities to analogous results that have been derived for sparsity-based CS. As with sparsitybased CS, we show for manifold-based CS that for any fixed Φ, uniform deterministic `2 recovery bounds for recovery of all x are necessarily poor. We then show that, as with sparsity-based CS, providing for any x a probabilistic bound that holds over most Φ is possible with the desired accuracy. We consider both noise-free and noisy measurement settings and compare our bounds with sparsity-based CS. Finally, it should be noted that our results concerning question Q1 are independent of the parametrization of the manifold, whereas, in contrast, our results concerning question Q2 are specific to the given parametrization of the manifold. We feel that a third contribution of this paper comes in the form of the analytical tools we use to study the above problems. Our chaining argument allows us to study not only the embedding problem (as in [12]) but also Q1 and Q2. Moreover, in Appendix A, which we call the “Toolbox,” we present a collection of implications of our assumption that the manifold has bounded condition number (see Section 2.2 for definition). This elementary property, also known as the reach of a manifold in the geometric measure theory literature [26], has become somewhat popular in the analysis of manifold models for signal processing (e.g., see [3, 14, 15, 35, 43, 59, 64]). The seminal paper [43] (also see [26]) contains a collection of implications of bounded condition number that have been used directly or indirectly in numerous works, including [3, 14, 15, 35, 59, 64]. We restate some of these implications in the Toolbox. Unfortunately, after very careful study we were unable to confirm for ourselves some of the original proofs appearing in [43]. Therefore, some of the statements and proofs in the Toolbox below differ slightly from their original counterparts in [43]. We hope that these results will provide a useful reference for the continued study of manifolds with bounded condition number. 1.5 Paper organization Section 2 provides the necessary background on sparsity-based CS and on manifold models to place our work in the proper context. In Section 3, we state our improved bound regarding stable embeddings of manifolds. In Section 4, we then formalize our criteria for answering questions Q1 and Q2 in the context of manifold models. We first confront the task of deriving deterministic instance-optimal bounds in `2 and then consider probabilistic instance-optimal bounds in `2 . We conclude in Section 5 with a final discussion. The Toolbox (Appendix A) establishes a collection of useful results in differential geometry that are frequently used throughout our technical proofs, which appear in the remaining appendices. 5 2 Background 2.1 2.1.1 Sparsity-Based Compressive Sensing Sparse models The concise modeling framework used in CS is sparsity. Consider a signal x ∈ RN and suppose the N × N matrix Ψ = [ψ1 ψ2 · · · ψN ] forms an orthonormal basis for RN . We say x is Ksparse in the basis Ψ if for α ∈ RN we can write x = Ψα, where kαk0 = K < N . (The `0 -norm notation counts the number of nonzeros of the entries of α.) In a sparse representation, the actual information content of a signal is contained exclusively in the K < N positions and values of its nonzero coefficients. For those signals that are approximately sparse, we may measure their proximity to sparse signals as follows. We define αK ∈ RN to be the vector containing only the largest K entries of α in magnitude, with the remaining entries set to zero. Similarly, we let xK = ΨαK . It is then common to measure the proximity to sparseness using either kα − αK k1 or kα − αK k (the latter of which equals kx − xK k because Ψ is orthonormal). Here and elsewhere in this paper, k · k stands for the `2 norm. 2.1.2 Stable embeddings of sparse signal families CS uses the concept of sparsity to simplify the data acquisition process. Rather than designing a sensor to measure a signal x ∈ RN , for example, it often suffices to design a sensor that can measure a much shorter vector y = Φx, where Φ is a linear measurement operator represented as an M × N matrix, and typically M N . The measurement matrix Φ must have certain properties in order to be suitable for CS. One desirable property (which leads to the theoretical results we mention in Section 2.1.3) is known as the Restricted Isometry Property (RIP) [6–8]. We say a matrix Φ meets the RIP of order K with respect to the basis Ψ if for some δK > 0, (1 − δK ) kαk ≤ kΦΨαk ≤ (1 + δK ) kαk holds for all α ∈ RN with kαk0 ≤ K. Intuitively, the RIP can be viewed as guaranteeing a stable embedding of the collection of K-sparse signals within the measurement space RM . In particular, supposing the RIP of order 2K is satisfied with respect to the basis Ψ, then for all pairs of K-sparse signals x1 , x2 ∈ RN , we have (1 − δ2K ) kx1 − x2 k ≤ kΦx1 − Φx2 k ≤ (1 + δ2K ) kx1 − x2 k . (1) Although deterministic constructions of matrices meeting the RIP with few rows (ideally proportional to the sparsity level K) are still a work in progress, it is known that the RIP can often be met by choosing Φ randomly from an acceptable distribution. For example, let Ψ be a fixed orthonormal basis for RN and suppose that M ≥ C1 K log(N/K) (2) for some constant C1 . Then supposing that the entries of the M × N matrix Φ are drawn as i.i.d. 1 Gaussian random variables with mean 0 and variance M , it follows that with high probability Φ meets the RIP of order K with respect to the basis Ψ. Two aspects of this construction deserve special notice: first, the number M of measurements required is linearly proportional to the information level K (and logarithmic in the ambient dimension N ), and second, neither the sparse basis 6 Ψ nor the locations of the nonzero entries of α need be known when designing the measurement operator Φ. Other random distributions for Φ may also be used, all requiring approximately the same number of measurements [25, 37, 49]. 2.1.3 Sparsity-based signal recovery Although the sparse structure of a signal x need not be known when collecting measurements y = Φx, a hallmark of CS is the use of the sparse model in order to facilitate understanding from the compressive measurements. A variety of algorithms have been proposed to answer Q1 (signal recovery), where we seek to solve the apparently undercomplete set of M linear equations y = Φx for N unknowns. The canonical method [5, 8, 18] is known as `1 -minimization and is formulated as follows: first solve α b = argminα0 ∈RN kα0 k1 subject to y = ΦΨα0 , (3) and then set x b = Ψb α. This recovery program can also be extended to account for measurement noise. The following bound is known. Theorem √ 1. [9] Suppose that Φ satisfies the RIP of order 2K with respect to Ψ and with constant δ2K < 2 − 1. Let x ∈ RN , and suppose that y = Φx + n where knk ≤ . Then let α b = arg min kα0 k1 subject to y − ΦΨα0 ≤ , α0 ∈RN and set x b = Ψb α. Then 1 kx − x bk = kα − α bk ≤ C1 K − 2 kα − αK k1 + C2 . (4) for constants C1 and C2 . This result is not unique to `1 minimization; similar bounds have been established for signal recovery using greedy iterative algorithms OMP [16], ROMP [42], and CoSAMP [41]. Bounds of this type are extremely encouraging for signal processing. From only M measurements, it is possible to recover x with quality that is comparable to its proximity to the nearest K-sparse signal, and if x itself is K-sparse and there is no measurement noise, then x can be recovered exactly. Moreover, despite the apparent ill-conditioning of the inverse problem, the measurement noise is not dramatically amplified in the recovery process. These bounds are known as deterministic, instance-optimal bounds because they hold deterministically for any Φ that meets the RIP, and because for a given Φ they give a guarantee for recovery of any x ∈ RN based on its proximity to the concise model. The use of `1 as a measure for proximity to the concise model (on the right hand side of (4)) arises due to the difficulty in establishing `2 bounds on the right hand side. Indeed, it is known that deterministic `2 instance-optimal bounds cannot exist that are comparable to (4). In particular, for any Φ, to ensure that kx − x bk ≤ C3 kx − xK k for all x, it is known [13] that this requires that M ≥ C4 N regardless of K. However, it is possible to obtain an instance-optimal `2 bound for sparse signal recovery in the noise-free setting by changing from a deterministic formulation to a probabilistic one [13, 17]. In particular, by considering any given x ∈ RN , it is possible to show that for most random Φ, letting the measurements y = Φx, and recovering x b via `1 -minimization (3), it holds that kx − x bk ≤ C5 kx − xK k . 7 (5) While the proof of this statement [17] does not involve the RIP directly, it holds for many of the same random distributions that work for RIP matrices, and it requires the same number of measurements (2) up to a constant. Similar bounds hold for the closely related problem of “sketching,” where the goal is to use the compressive measurement vector y to identify and report only approximately K expansion coefficients that best describe the original signal, i.e., a sparse approximation to αK . In the case where Ψ = I, an efficient randomized measurement process coupled with a customized recovery algorithm [29] provides signal sketches that meet a deterministic mixed-norm `2 /`1 instance-optimal bound analogous to (4) in the noise-free setting. A desirable aspect of this construction is that the computational complexity scales with only log(N ) (and is polynomial in K); this is possible because only approximately K pieces of information must be computed to describe the signal. Though at a higher computational cost, the aforementioned greedy algorithms (such as CoSaMP) for signal recovery can also be interpreted as sketching techniques in that they produce explicit sparse approximations to αK . Finally, for signals that are sparse in the Fourier domain (Ψ consists of the DFT vectors), probabilistic `2 /`2 instance-optimal bounds have been established for a specialized sketching algorithm [27, 28] that are analogous to (5). 2.2 2.2.1 Manifold models and properties Overview As we have discussed in Section 1.4, there are many possible modeling frameworks for capturing concise signal structure. Among these possibilities are the broad class of manifold models. Manifold models arise, for example, in settings where the signals of interest vary continuously as a function of some K-dimensional parameter. Suppose, for instance, that there exists some parameter θ that controls the generation of the signal. We let xθ ∈ RN denote the signal corresponding to the parameter θ, and we let Θ denote the K-dimensional parameter space from which θ is drawn. In general, Θ itself may be a K-dimensional manifold and need not be embedded in an ambient Euclidean space. For example, supposing θ describes the 1-D rotation parameter in a top-down satellite image, we have Θ = S1 . Under certain conditions on the parameterization θ 7→ xθ , it follows that M := {xθ : θ ∈ Θ} forms a K-dimensional submanifold of RN . An appropriate visualization is that the set M forms a nonlinear K-dimensional “surface” within the high-dimensional ambient signal space RN . Depending on the circumstances, we may measure the distance between points two points xθ1 and xθ2 on the manifold M using either the ambient Euclidean distance kxθ1 − xθ2 k or the geodesic distance along the manifold, which we denote as dM (xθ1 , xθ2 ). In the case where the geodesic distance along M equals the native distance in parameter space, i.e., when dM (xθ1 , xθ2 ) = dΘ (θ1 , θ2 ), (6) we say that M is isometric to Θ. The definition of the distance dΘ (θ1 , θ2 ) depends on the appropriate metric for the parameter space Θ; supposing Θ is a convex subset of Euclidean space, then we can let dΘ (θ1 , θ2 ) = kθ1 − θ2 k. While our discussion above concentrates on the case of manifolds M generated by underlying parameterizations, we stress that manifolds have also been proposed as approximate low-dimensional models within RN for nonparametric signal classes such as images of human faces or handwritten digits [4, 33, 58]. These signal families may also be considered. The results we present in this paper will make reference to certain characteristic properties of the manifold under study. These terms are originally defined in [3, 43] and are repeated here for 8 completeness. First, our results will depend on a measure of regularity for the manifold. For this purpose, we adopt the notion of the condition number of a manifold, which is also known as the reach of a manifold in the geometric measure theory literature [26, 43]. Definition 1. [43] Let M be a compact Riemannian submanifold of RN . The condition number is defined as 1/τ , where τ is the largest number having the following property: The open normal bundle about M of radius r is embedded in RN for all r < τ . The condition number 1/τ controls both local properties and global properties of the manifold. Its role is summarized in two key relationships (see the Toolbox and [43] for more detail). First, the the curvature of any unit-speed geodesic path on M is bounded by 1/τ . Second, at long geodesic distances, the condition number controls how close the manifold may curve back upon itself. For example, supposing x1 , x2 ∈ M with dM (x1 , x2 ) > τ , it must hold that kx1 − x2 k > τ /2. We continue with a brief but concrete example to illustrate specific values for these quantities. Let N > 0, κ > 0, Θ = R mod 2π, and suppose xθ ∈ RN is given by xθ = [κ cos(θ); κ sin(θ); 0; 0; · · · 0]T . In this case, M = {xθ : θ ∈ Θ} forms a circle of radius κ in the x(1), x(2) plane. The manifold dimension K = 1, and the condition number 1/τ = 1/κ. We also refer in our results to the Kdimensional volume of M, denoted by VM , which in this example corresponds to the circumference 2πκ of the circle. We conclude this section with a less trivial example of computing the condition number (or, alternatively, reach). 2.2.2 Example: Complex exponential curve For an integer fC , set N := 2fC + 1. Let β : R → CN denote the complex exponential curve defined as e−i2πfC t −i2π(fC −1)t e . .. βt = β(t) = (7) i2π(fC −1)t e i 2πf t C e for t ∈ R. The following result, proved in Appendix B, gives an estimate of the condition number (which we denote here by 1/τβ ) of the complex exponential curve β.2 The reader may refer to [63] for related computations concerning the complex exponential curve. Lemma 1. For the complex exponential curve β in CN (as defined in (7)), let 1/τβ denote its condition number. Then, for some integer Nsine and (known) constant αsine < 1, the following holds if N > Nsine : √ √ αsine N ≤ τβ ≤ N . 2 Unlike β, which is a subset of CN , its real or imaginary parts live in RN and are perhaps more consistent with the rest of this paper (which studies submanifolds of RN ). However, finding the condition number of re(β) or im(β) is far more tedious and therefore not pursued here for the sake of the clarity of our exposition. 9 3 Stable embeddings of manifolds In cases where the signal class of interest M forms a low-dimensional submanifold of RN , we have theoretical justification that the information necessary to distinguish and recover signals x ∈ M can be well-preserved under a sufficient number of compressive measurements y = Φx. In particular, it was first shown in [3] that an RIP-like property holds for families of manifold-modeled signals. The result stated that, under a random projection onto RM , pairwise distances between the points on M are approximately preserved with high probability, provided that M , the number of measurements, is large enough. Mainly, M should scale linearly with the dimension K of M and logarithmically with the ambient dimension N . The dependence on N was later removed in [12], which used a different set of assumptions on the manifold to help derive a sharper lower bound on the requisite number of random measurements. Unfortunately, the results given in [12] hold only as the isometry constant → 0 in (1), with asymptotic threshold and constants fixed but unspecified. The manifold properties assumed in [12] are arguably more complicated and less commonly used than the manifold volume and condition number which are at the heart of our results. (On the other hand, there do exist manifolds where the properties assumed in [12] clearly provide a stronger analysis.) In this section, we establish an improved lower bound on M to ensure a stable embedding of a manifold. We make the same assumptions on the manifold as in our past work [3] but provide a measurement bound that is independent of the ambient dimension N . Our bound holds for every ≤ 1/3 and we provide explicit constants. The proof, presented in Appendix C, draws from the ideas in generic chaining [56], which have been recently used to develop state-of-the-art RIP results for structured measurement matrices in CS [24, 37, 48, 50–53, 57]. As in [12], we control the failure probability of the manifold embedding by forming a so-called chain on a sequence of increasingly finer covers on the index set of the random process [39, 56]. Aside from delivering an improved bound (and also allowing us to study Q1 and Q2 in Section 4), we hope that our exposition in this paper will encourage yet more researchers in the field of CS to use this powerful technique. Theorem 2. Let M be a compact K-dimensional Riemannian submanifold of RN having condition number 1/τ and volume VM . Conveniently assume that3 VM ≥ τK 21 √ 2 K K . (8) Fix 0 < ≤ 1/3 and 0 < ρ < 1. Let Φ be a random M × N matrix populated with i.i.d. zero-mean Gaussian random variables with variance of 1/M with √ ! ! K 8 −2 2 M ≥ 18 max 24K + 2K log + log(2VM ), log . (9) τ 2 ρ Then with probability at least 1 − ρ the following statement holds: For every pair of points x1 , x2 ∈ M, (1 − ) kx1 − x2 k ≤ kΦx1 − Φx2 k ≤ (1 + ) kx1 − x2 k . (10) The proof of the above result can be found in Appendix C. In essence, manifolds with higher volume or with greater curvature have more complexity, which leads to an increased number of 3 Theorem 2 still holds, with a worse (larger) lower bound in (9), after relaxing the assumption in (8). One example of a manifold that does satisfy the assumption in (8) is the complex exponential curve described in Section 2.2.2, as long as N ≥ 7. 10 measurements (9). By comparing (1) with (10), we see a strong analogy to the RIP of order 2K. This theorem establishes that, like the class of K-sparse signals, a collection of signals described by a K-dimensional manifold M ⊂ RN can have a stable embedding in an M -dimensional measurement space. Moreover, the requisite number of random measurements M is once again almost linearly proportional to the information level (or number of degrees of freedom) K. It is important to note that in (9), the combined dependence on the manifold dimension K, condition number 1/τ , and volume VM cannot, generally speaking, be improved. In particular, consider the case where M is a K-dimensional unit ball in RN , that is M = BK . Clearly, in this case, τ = 1. Additionally, from 2 ∝ −K log K. As a result, plugging for (72), we observe that VM = VBK ∝ K −K/2√and so log VM VM back into (10) cancels the term 2K log K = K log K on the right hand side of (9). It follows that the lower bound in (9) scales with K (rather than K log K) in this case. We conclude that, in this special case, the lower bound in (10) is optimal (up to a constant factor). As was the case with the RIP for sparse signal processing, this sort of result has a number of possible implications for manifold-based signal processing. First, individual signals obeying a manifold model can be acquired and stored efficiently using compressive measurements, and it is unnecessary to employ the manifold model itself as part of the compression process. Rather, the model needs only to be used for signal understanding from the compressive measurements. Second, problems such as Q1 (signal recovery) and Q2 (parameter estimation) can be addressed, as we discuss in Section 4. Aside from this theoretical analysis, we have reported promising experimental recovery/estimation results with various classes of parametric signals [14, 61]. Also, taking a different analytical perspective (a statistical one, assuming additive white Gaussian measurement noise), estimation-theoretic quantities such as the Cramer-Rao Lower Bound (for a specialized set of parametric problems) have been shown to be preserved in the compressive measurement space as a consequence of the stable embedding [47]. Third, the stable embedding results can also be extended to the case of multiple manifolds that are simultaneously embedded [14]; this allows both the classification of an observed object to one of several possible models (different manifolds) and the estimation of a parameter within that class (position on a manifold). Fourth, collections of signals obeying a manifold model (such as multiple images of a scene photographed from different perspectives) can be acquired using compressive measurements, and the resulting manifold structure will be preserved among the suite of measurement vectors in RM [15,46]. Fifth, we have provided empirical and theoretical support for the use of manifold learning in the reduced-dimensional space [32]; this can dramatically simplify the computational and storage demands on a system for processing large databases of signals. Before presenting an application of Theorem 2 in the next section, we would like to outline the chaining argument used in its proof through an example. Suppose that M is the unit circle embedded in RN and that we observe this manifold through a measurement operator Φ ∈ RM ×N . To study the quality of embedding, we first need to identify the set of all secants connecting two points in M. In this example, the set of all normalized secants of M (which we denote by U (M)) also forms a unit circle and equals M, i.e., U (M) = M. Let {Cj } be a sequence of increasingly finer covers on M (or equivalently on U (M)). Constructing a sequence of covers for the secants of a manifold in general is studied in Appendix C.1. For an arbitrary normalized secant y = (x1 − x2 )/kx1 − x2 k with x1 , x2 ∈ M, let πj (y) represent the nearest point to y on the jth cover (for every j ≥ 0). This construction is illustrated in Figure 2. We can use the telescoping sum X y = π0 (y) + (πj (y) − πj−1 (y)) j≥1 11 Figure 2: A sequence of increasingly finer covers for the unit circle. Also shown is an arbitrary point y and its projection πj (y) onto each cover. to write that ( ) kΦx1 − Φx2 k >1+ P sup x1 ,x2 ∈M kx1 − x2 k ( ) sup kΦyk > 1 + =P y∈U (M) ( ) sup kΦyk > 1 + =P y∈M X ≤ P sup kΦπ0 (y)k + sup kΦ(πj (y) − πj−1 (y))k > 1 + j y∈M y∈M j≥1 j≥0 X X ≤ P max kΦpk + kΦ(p − q)k > 1 + max j p∈C0 (p,q)∈Cj ×Cj−1 j≥1 j≥0 X ≤ #C0 · max P {kΦpk > 1 + 0 } + P {kΦ(p − q)k > j } , #Cj · #Cj−1 max X p∈C0 (p,q)∈Cj+1 ×Cj j≥1 (11) P where {j } is an exponentially-fast decreasing sequence of constants such that j j = . The third line uses the fact that U (M) = M here. The last line above uses the union bound. Therefore the failure probability of obtaining a stable embedding of M is controlled by an infinite series that only involves the sequence of covers constructed earlier. As we will see in more detail later, given enough measurements, the (exponentially growing) size of the covers #Cj can be balanced by the (exponentially decreasing) failure probabilities in the last line of (11) to guarantee that overall failure probability is exponentially small. A more general version of this chaining argument is detailed in Appendix C.2. 4 Manifold-based signal recovery and parameter estimation In this section, we establish theoretical bounds on the accuracy to which problems Q1 (signal recovery) and Q2 (parameter estimation) can be solved under a manifold model. To be specific, let us consider a length-N signal x that is not necessarily K-sparse, but rather that we assume lives on or near some known K-dimensional manifold M ⊂ RN . From a collection of measurements y = Φx + n, 12 where Φ is a random M × N matrix and n ∈ RM is an additive noise vector, we would like to recover either x or a parameter θ that generates x. For the signal recovery problem, we will consider the following as a method for estimating x. Solve the program min ky − Φx0 k, (12) 0 x ∈M and let x b, as an estimate of x, be a solution of the above program. We also let x∗ be a solution of the program min kx − x0 k (13) 0 x ∈M and, therefore, an optimal nearest neighbor to x on M. To consider signal recovery successful, we would like to guarantee that kx − x bk is not much larger than kx − x∗ k. For the parameter estimation problem, where we presume x ≈ xθ for some θ ∈ Θ, we propose a similar method for estimating θ from the compressive measurements. Solve the program min ky − Φxθ0 k, θ0 ∈Θ (14) b as an estimate of θ, be a solution of the above program. Also let θ∗ be a solution of the and let θ, program min kx − xθ0 k. (15) 0 θ ∈Θ θ∗ Here, is an optimal estimate that could be obtained using the full data x ∈ RN . (If x = xθ exactly for some θ, then θ∗ = θ; otherwise this formulation allows us to consider signals x that are not precisely on the manifold M in RN . This generalization has practical relevance; a local image block, for example, may only approximately resemble a straight edge, which has a simple parameterization.) To consider parameter estimation successful, we would like to guarantee that b θ∗ ) is small. dΘ (θ, As we will see, bounds pertaining to accurate signal recovery can often be extended to imply accurate parameter estimation as well. However, the relationships between distance dΘ in parameter space and distances dM and k · k in the signal space can vary depending on the parametric signal model under study. Thus, for the parameter estimation problem, our ability to provide generic b θ∗ ) will be restricted. In this paper we focus primarily on the signal recovery bounds on dΘ (θ, problem and provide preliminary results for the parameter estimation problem that pertain most strongly to the case of isometric parameterizations. In this paper, we do not confront in depth the question of how a recovery program such as (12) can be efficiently solved. Some efforts in this direction have recently appeared in [31,35,54]. In [54], the authors guarantee the success of a gradient-projection algorithm for recovering a signal that lives exactly on the manifold from noisy compressive measurements. The keys to the success of this method are a stable embedding of the manifold (as is guaranteed by [3] or our Theorem 2) and the knowledge of the projection operator onto the manifold within RN . In [35], the authors construct a geometric multi-resolution approximation of a manifold using a collection of affine subspaces. A major contribution of that work is a recovery algorithm that works by assigning a measured signal to the closest projected affine subspace in the compressed domain. Two recovery results are presented. In the first of these, the number of measurements is independent of the ambient dimension and the recovery error holds for any given signal in the ambient space. All of this is analogous to our Theorem 4 (a probabilistic instance-optimal bound in `2 ), but the recovery is guaranteed for a particular algorithm. Unlike that result, however, our Theorem 4 includes explicit constants, allows for the consideration of measurement noise, and falls nearly for free out of our 13 novel analytical framework based on chaining. A second result appearing in [35] provides a special type of deterministic instance-optimal bound for signal recovery and involves embedding arguments that extend those in [3]. It would be interesting to see if our improved embedding guarantees in the present paper could now be used to remove the dependence on the ambient dimension appearing in that result. In [11], the authors provide a Bayesian treatment of the signal recovery problem using a mixture of (low-rank) Gaussians for approximating the manifold. Furthermore, some discussion of signal recovery is provided in [3], with application-specific examples provided in [14, 61]. Unfortunately, it is difficult to propose a single general-purpose algorithm for solving (12) in RM , as even the problem (13) in RN may be difficult to solve depending on certain nuances (such as topology) of the individual manifold. Additional complications arise when the manifold M is non-differentiable, as may happen when the signals x represent 2-D images. However, just as a multiscale regularization can be incorporated into Newton’s method for solving (13) (see [62]), an analogous regularization can be incorporated into a compressive measurement operator Φ to facilitate Newton’s method for solving (12) (see [19,21,61]). For manifolds that lack differentiability, additional care must be taken when applying results such as Theorem 2. We therefore expect that the research on signal recovery and approximation based on low-dimensional manifold models will witness even more growth in the future. It is also crucial to study the theoretical limits and guarantees in this problem; in what follows, we will consider both deterministic and probabilistic instance-optimal bounds for signal recovery and parameter estimation, and we will draw comparisons to the sparsity-based CS results of Section 2.1.3. Our bounds are formulated in terms of generic properties of the manifold (as mentioned in Section 2.2), which will vary from signal model to signal model. In some cases, calculating these may be possible, whereas in other cases it may not. Nonetheless, we feel the results in this paper highlight the relative importance of these properties in determining the requisite number of measurements. 4.1 A deterministic instance-optimal bound in `2 We begin by seeking an instance-optimal bound. That is, for a measurement matrix Φ that meets (10) for all x1 , x2 ∈ M, we seek an upper bound for the relative reconstruction error kx − x bk kx − x∗ k that holds uniformly for all x ∈ RN . We would also like this bound to account for noise in the measurements. In this section we consider only the signal recovery problem; however, similar bounds would apply to parameter estimation. We have the following result, which is proved in Appendix E. Theorem 3. Fix 0 < ≤ 1/3 and 0 < ρ < 1. Let M be as described in Theorem 2. Assume that Φ satisfies (10) for all pairs of points x1 , x2 ∈ M. Take x ∈ RN , let y = Φx + n, and let the recovered estimate x b and an optimal estimate x∗ be as defined in (12) and (13). Then the following holds: kx − x bk ≤ (1 + 2) (2σM (Φ) + 1) kx − x∗ k + (2 + 4)knk, (16) where σM (Φ) is the largest singular value of Φ. In particular, it is interesting to consider the case where Φ is a random Gaussian matrix as described in Theorem 2. It is well-known, e.g., [60, Corollary 5.35], that the nonzero singular 14 values of Φ satisfy the following: ( r P σM (Φ) > ( r P σm (Φ) < N +1+t M ) 2 M/2 ≤ e−t , (17) ) N 2 − 1 − t ≤ e−t M/2 , M (18) for t > 0. Here, σM (Φ) and σm (Φ) are the largest and smallest (nonzero) singular values of Φ, respectively. Suppose that M satisfies (9) so that the promises of Theorem 2 hold except for a probability of at most ρ. Set t = 1 in (17). Now since e−M/2 ≤ ρ, we have that r N σM (Φ) ≤ + 2, M except with a probability of at most ρ. In combination with Theorem 3, it finally follows that, except with a probability of at most 2ρ, Φ satisfies (10) for every pair of points on the manifold and that ! r N (19) + 5 kx − x∗ k + (2 + 4)knk, kx − x bk ≤ (1 + 2) 2 M for every x ∈ RN . Here, x b and x∗ are as defined in (12) and (13). In the q noise-free case (knk = 0) and M N as N → 0, the bound on the right hand side of (19) grows as (2+4) M . Unfortunately, this is not desirable for signal recovery. Supposing, for example, that we wish to ensure kx − x bk ≤ C6 kx − x∗ k for all x ∈ RN (assuming no measurement noise), then using the bound (19) we would require that M ≥ C7 N regardless of the dimension K of the manifold. The weakness of this bound is a geometric necessity; indeed, the bound itself is quite tight in general, as the following simple example illustrates. The proof can be found in Appendix F. Proposition 1. Fix 0 < ≤ 1/3. Let M denote the line segment in RN joining the origin and e1 := [1, 0, 0, . . . , 0]T . Suppose that Φ satisfies (10) for all x1 , x2 ∈ M and that σm (Φ) ≥ 8/3. Then, there exists a point x ∈ RN such that if y = Φx (with no measurement noise), and if x b and x∗ are defined in (12) and (13), kx − x bk 1 ≥ σm (Φ). ∗ kx − x k 2(1 + ) In particular, consider the case where Φ is a random Gaussian matrix as described in Theorem 2. According to (76) and (18) (with t = 1), the following two statements are valid except with a 2 2 probability of at most 2e−M /6 + 2e−M/2 ≤ 4e−M /6 . First, (10) holds for every x1 , x2 ∈ M. Second, r N σm (Φ) ≥ − 2. (20) M If we assume that N/M ≥ (14/3)2 , we can conclude, using Proposition 1, that Φ satisfies (10) for every x1 , x2 ∈ M and yet there exists x ∈ RN such that r kx − x bk 1 N ≥ , kx − x∗ k 4(1 + ) M 2 /6 except with a probability of at most 4e−M on the choice of Φ. 15 It is worth recalling that, as we discussed in Section 2.1.3, similar difficulties arise in sparsitybased CS when attempting to establish a deterministic `2 instance-optimal bound. In particular, to ensure that kx − x bk ≤ C3 kx − xK k for all x ∈ RN , it is known [13] that this requires M ≥ C4 N regardless of the sparsity level K. In sparsity-based CS, there have been at least two types of alternative approaches. The first are the deterministic “mixed-norm” results of the type given in (4). These involve the use of an alternative norm such as the `1 norm to measure the distance from the coefficient vector α to its best K-term approximation αK . While it may be possible to pursue similar directions for manifoldmodeled signals, we feel this is undesirable as a general approach because when sparsity is no longer part of the modeling framework, the `1 norm has less of a natural meaning. Instead, we prefer to seek bounds using `2 , as that is the most conventional norm used in signal processing to measure energy and error. Thus, the second type of alternative bounds in sparsity-based CS have involved `2 bounds in probability, as we discussed in Section 2.1.3. Indeed, the performance of both sparsity-based and manifold-based CS is often much better in practice than a deterministic `2 instance-optimal bound might indicate. The reason is that, for any Φ, such bounds consider the worst case signal over all possible x ∈ RN . Fortunately, this worst case is not typical. As a result, it is possible to derive much stronger results that consider any given signal x ∈ RN and establish that for most random Φ, the recovery error of that signal x will be small. 4.2 Probabilistic instance-optimal bounds in `2 For a given measurement operator Φ, our bound in Theorem 3 applies uniformly to any signal in RN . However, a much sharper bound can be obtained by relaxing the deterministic requirement. 4.2.1 Signal recovery Our first bound applies to the signal recovery problem. The proof of this result is provided in Appendix G and, like that of Theorem 2, involves a generic chaining argument. Theorem 4. Suppose x ∈ RN . Let M be a compact K-dimensional Riemannian submanifold of RN having condition number 1/τ and volume VM . Conveniently assume that4 VM ≥ τK 21 √ K K . (21) Fix 0 < ≤ 1/3 and 0 < ρ < 1. Let Φ be an M × N random matrix populated with i.i.d. zero-mean Gaussian random variables with variance 1/M , chosen independently of x, with M satisfying (9). Let n ∈ RM , let y = Φx + n, and let the recovered estimate x b and an optimal estimate x∗ be as defined in (12) and (13). Then with a probability of at least 1 − 4ρ, the following statement holds: ! ! r τ N ∗ ∗ , (1 + 2) 2 + 5 kx − x k + (2 + 4) knk . (22) kx − x bk ≤ min (1 + 3) kx − x k + 40 M Roughly speaking, one can discern two different operating regimes in (22): 4 Theorem 4 still holds, with worse (larger) constants, after relaxing this assumption. 16 • When x is sufficiently far from the manifold (kx − x∗ k τ ), then (22) roughly reads kx − x bk ≤ (1 + 3) kx − x∗ k + (2 + 4) knk . In particular, by setting knk = 0 in the bound above (which corresponds to the noise-free setup), we obtain a multiplicative error bound: The recovery error from compressive measurements kx − x bk is no larger than twice the recovery error from a full set of measurements ∗ kx − x k. p • On the other hand, when x is close to the manifold (kx−x∗ k τ M/N ), then (22) becomes ! r N + 5 kx − x∗ k + (2 + 4) knk . kx − x bk ≤ (1 + 2) 2 M When knk = 0, we still obtain a multiplicative error bound but with a larger factor in front of kx − x∗ k. Let us also compare and contrast our bound with the analogous results for sparsity-based CS. Like Theorem 1, we consider the problem of signal recovery in the presence of additive measurement noise. Both bounds relate the recovery error kx − x bk to the proximity of x to its nearest neighbor in the concise model class (either xK or x∗ depending on the model), and both bounds relate the recovery error kx − x bk to the amount knk of additive measurement noise. However, Theorem 1 is a deterministic bound whereas Theorem 4 is probabilistic, and our bound (22) measures proximity to the concise model in the `2 norm, whereas (4) uses the `1 norm. Our bound can also be compared with (5), as both are instance-optimal bounds in probability, and both use the `2 norm to measure proximity to the concise model. However, we note that unlike (5), our bound (22) allows the consideration of measurement noise. 4.2.2 Parameter estimation Above we have derived a bound for the signal recovery problem, with an error metric that measures the discrepancy between the recovered signal x b and the original signal x. However, in some applications it may be the case that the original signal x ≈ xθ∗ , where θ∗ ∈ Θ is a parameter of interest. In this case we may be interested in using the compressive measurements y = Φx + n to solve the problem (14) and recover an estimate θb of the underlying parameter. Of course, these two problems are closely related. However, we should emphasize that guaranteeing kx − x bk ≈ kx − x∗ k does not automatically guarantee that dM (xθb, xθ∗ ) is small (and b θ∗ ) is small). If the manifold is shaped like a horseshoe, for therefore does not ensure that dΘ (θ, example, then it could be the case that xθ∗ sits at the end of one arm but xθb sits at the end of the opposing arm. These two points would be much closer in a Euclidean metric than in a geodesic one. Consequently, in order to establish bounds relevant for parameter estimation, our concern focuses on guaranteeing that the geodesic distance dM (xθb, xθ∗ ) is itself small. Our next result is proved in Appendix H. Theorem 5. Suppose x ∈ RN , and fix 0 < ≤ 1/3 and 0 < ρ < 1. Let M and Φ be as described in Theorem 4, assuming that M satisfies (9) and that the convenient assumption (21) holds. Let n ∈ RM , let y = Φx + n, and let the recovered estimate x b and an optimal estimate x∗ be as defined 17 in (12) and (13). If kx − x∗ k + statement holds: 10 9 knk ≤ 0.163τ , then with probability at least 1 − 4ρ the following τ dM (b x, x ) ≤ min (4 + 6)kx − x k + , 20 ∗ ∗ r (4 + 8) ! ! N ∗ + 12 + 20 kx − x k + (4 + 8)knk. M (23) In several ways, this bound is similar to (22). Both bounds relate the recovery error to the proximity of x to its nearest neighbor x∗ on the manifold and to the amount knk of additive measurement noise. Both bounds also have an additive term on the right hand side that is small in relation to the condition number τ . In contrast, (23) guarantees that the recovered estimate x b is near to the optimal estimate x∗ in terms of geodesic distance along the manifold. Establishing this condition required the additional assumption that kx − x∗ k+ 10 9 knk ≤ 0.163τ . Because τ relates to the degree to which the manifold can curve back upon itself at long geodesic distances, this assumption prevents exactly the type of “horseshoe” problem that was mentioned above, where it may happen that dM (b x, x∗ ) kb x − x∗ k. Suppose, for example, it were to happen that kx − x∗ k ≈ τ and x was approximately equidistant from both ends of the horseshoe; a small distortion of distances under Φ could then lead to an estimate x b for which kx − x bk ≈ kx − x∗ k but dM (b x, x∗ ) 0. Similarly, additive noise could cause a similar problem of “crossing over” in the measurement space. Although our bound provides no guarantee in these situations, we stress that under these circumstances, accurate parameter estimation would be difficult (or perhaps even unimportant) in the original signal space RN . Finally, we revisit the situation where the original signal x ≈ xθ∗ for some θ∗ ∈ Θ (with θ∗ satisfying (15)), where the measurements y = Φx + n, and where the recovered estimate θb satisfies b θ∗ ). As (14). We consider the question of whether (23) can be translated into a bound on dΘ (θ, described in Section 2.2, in signal models where M is isometric to Θ, this is automatic: we have simply that b θ∗ ). dM (xθb, xθ∗ ) = dΘ (θ, Such signal models are not nonexistent. Work by Donoho and Grimes [20], for example, has characterized a variety of articulated image classes for which (6) holds or for which dM (xθ1 , xθ2 ) = C8 dΘ (θ1 , θ2 ) for some constant C8 > 0. In other models it may hold that C9 dM (xθ1 , xθ2 ) ≤ dΘ (θ1 , θ2 ) ≤ C10 dM (xθ1 , xθ2 ) for constants C9 , C10 > 0. Each of these relationships may be incorporated to the bound (23). 5 Conclusions In this paper, we have provided an improved and non-asymptotic lower bound on the number of requisite measurements to ensure a stable embedding of a manifold under a random linear measurement operator. We have also considered the tasks of signal recovery and parameter estimation using compressive measurements of a manifold-modeled signal, and we have established theoretical bounds on the accuracy to which these questions may be answered. Although these problems differ substantially from the mechanics of sparsity-based CS, we have seen a number of similarities that arise due to the low-dimensional geometry of the each of the concise models. First, we have seen that a sufficient number of compressive measurements can guarantee a stable embedding of either type of signal family, and the requisite number of measurements scales essentially linearly with the 18 information level of the signal. Second, we have seen that deterministic instance-optimal bounds in `2 are necessarily weak for both problems. Third, we have seen that probabilistic instance-optimal bounds in `2 can be derived that give the optimal scaling with respect to the signal proximity to the concise model and with respect to the amount of measurement noise. Thus, our work supports the growing evidence that manifold-based models can be used with high accuracy in compressive signal processing. Most of our analysis in this paper rests on a new analytical framework for studying manifold embeddings that uses tools from the theory of empirical processes (namely, the idea of generic chaining). While such tools are becoming more widely adopted in the analysis of sparsity-based CS problems, we believe they are also very promising for studying the interactions of nonlinear signal families (such as manifolds) with random, compressive measurement operators. We hope that the chaining argument in this paper will be useful for future investigations along these lines. Acknowledgements M.B.W. is grateful to Rich Baraniuk and the Rice CS research team for many stimulating discussions. A.E. thanks Justin Romberg for introducing him to the generic chaining and other topics in the theory of empirical processes, Han Lun Yap for his valuable contributions to an early version of the proof of Theorem 2 and many productive discussions about the topic, and Alejandro Weinstein for helpful discussions. Finally, both authors would like to acknowledge the tremendous and positive influence that the late Partha Niyogi has had on our work. A Toolbox We begin by introducing some notation that will be used throughout the rest of the appendices. In this paper, N stands for the set of nonnegative integers. The tangent space of M at p ∈ M is denoted Tp . The orthogonal projection operator onto this linear subspace is denoted by ↓p . We let ∠ [·, ·] represent the angle between two vectors after being shifted to the same starting point. Throughout this paper, dM (·, ·) measures the geodesic distance between two points on M. By r-ball we refer to a Euclidean (open) ball of radius r > 0. In addition, with BN we denote the unit ball in RN with volume VBN and we reserve BN (p, r) to represent an N -dimensional r-ball centered at p in RN . For r > 0, let AM (p, r) := M ∩ BN (p, r) − p denote a (relatively) open neighborhood of p on M after being shifted to the origin. Here the subtraction is in the Minkowski sense. The K-dimensional ball of radius r in Tp will be denoted by BTp ; this ball is centered at the origin, as Tp is a linear subspace. Unless otherwise stated, all distances are measured in the Euclidean metric. A collection of N -dimensional r-balls that covers M is called an r-cover for M, with their centers forming a so-called r-net for M. Notice that in general we do not require a net for M to be a subset of M. However, we define the covering number of M at scale r, NM (r), to be the cardinality of a minimal r-net for M among all subsets of M. (In other words, NM (r) is the smallest number of r-balls centered on M that it takes to cover M.) A maximal r-separated subset of M is called an r-packing for M. The packing number of M at scale r > 0, denoted by PM (r), is the cardinality of such a set. It can be easily verified that an r-packing for M is also an r-cover for M, so NM (r) ≤ PM (r). (24) The concept of (principal) angle between subspaces will later come in handy. The (principal) angle between two linear subspaces Tp and Tq is defined such that cos (∠ [Tp , Tq ]) = minu maxv |hu, vi|, 19 where the unit vectors u and v belong to Tp and Tq , respectively. It is known that k ↓p (·)− ↓q (·)k2,2 = sin (∠ [Tp , Tq ]) , (25) where, as defined above, ↓p and ↓q are linear orthogonal projectors onto the tangent subspaces Tp and Tq , respectively [55, Theorem 2.5].5 The norm above is the spectral norm, namely the operator norm from RN equipped with `2 to itself. We will also use the following conventions to clarify the exposition. For x1 6= x2 ∈ RN , define U (x1 , x2 ) := x2 − x1 . kx2 − x1 k Additionally, we let U (S1 , S2 ) denote the set of directions of all the chords connecting two sets S1 , S2 ⊆ RN , namely U (S1 , S2 ) := {U (x1 , x2 ) : x1 ∈ S1 , x2 ∈ S2 , x1 6= x2 } . Clearly, U (S1 , S2 ) ⊆ SN −1 , where SN −1 is the unit sphere in RN . Whenever possible, we also simplify our notation by using U (S) := U (S, S). Below we list a few useful results (mostly from differential geometry) which are used throughout the rest of the paper. We begin with a well-known bound on the covering number of Euclidean balls, e.g., [60, Lemma 5.2].6 Lemma 2. A K-dimensional unit ball can be covered by at most (3/r)K r-balls with r ≤ 1. We now recall several results from Sections 5 and 6 in [43]. Unfortunately we were unable to confirm for ourselves some of the original proofs appearing in [43]. Therefore, some of the statements and proofs below differ slightly from their original counterparts. The first result is closely related to Lemma 5.3 in [43]. Lemma 3. Fix p, q ∈ M, such that kq − pk < 2τ . Then ∠ [q − p, ↓p (q − p)] ≤ sin−1 (kq − pk/2τ ). Proof. Consider the unit vector v along (q − p)− ↓p (q − p) ⊥ Tp and the point z := p + τ · v. Observe that z − p is orthogonal to the manifold at p. By definition of the condition number, the distance from z to the manifold is minimized at p and we must therefore have kz − qk ≥ τ .7 Now consider the triangle formed by the points p, q, z and the line l passing through z and perpendicular to q − p. Let z 0 denote the intersection of l with the line passing through p and q. (See Figure 3.) 5 In fact, (25) holds for any two linear subspaces (not only tangent subspaces of a manifold). Lemma 5.2 in [60] concerns the unit sphere in RK , but the result still holds for the unit Euclidean ball using essentially the same argument. 7 To see this, consider a sequence of points zn := p + (τ − 1/n) · v for integer values of n. For each n, kzn − pk = τ − 1/n < τ , and zn − p is orthogonal to the manifold at p. Therefore, by the definition of the condition number, no point q 0 ∈ M can satisfy kzn − q 0 k < kzn − pk. Thus, the distance from zn to the manifold equals τ − 1/n. Taking the limit as n → ∞ and using the continuity of the distance function, we conclude that the distance from z to M equals τ . So, no point q 0 ∈ M can satisfy kz − q 0 k < kz − pk = τ . 6 20 Figure 3: See proof of Lemma 3. It is clear that ∠[q − p, z − p] ≤ π/2. Also since kz − qk ≥ kz − pk = τ , we have ∠[z − q, p − q] ≤ ∠[z − p, q − p] ≤ π/2. Therefore, z 0 is indeed between p and q. The angle between l and the line passing through p and z equals the angle between q − p and ↓p (q − p), that is ∠[p − z, z 0 − z] = ∠[q − p, ↓p (q − p)] =: θp,q . (26) To obtain an upper bound for θp,q , we again note that kz − qk ≥ kz − pk and therefore kz 0 − qk ≥ kz 0 − pk, or kz 0 − pk ≤ 12 kq − pk. So, θp,q is bounded by sin−1 (kq − pk/2τ ). This completes the proof of Lemma 3. Lemma 4. [43, Lemma 5.4] AM (p, τ /2). For p ∈ M, the derivative of ↓p : RN → Tp is nonsingular on Lemma 5. [43, Proposition 6.1] Let γ(·) denote a smooth unit-speed geodesic curve on M defined on an interval I ⊂ R. For every t ∈ I, the following holds. kγ 00 (t)k ≤ 1/τ. Lemma 6. [43, Proposition 6.2] Fix p, q ∈ M. The angle between Tp and Tq , ∠[Tp , Tq ], satisfies cos(∠[Tp , Tq ]) ≥ 1 − dM (p, q) /τ . The next lemma guarantees that two points separated by a small Euclidean distance are also separated by a small geodesic distance, and so the manifold does not “curve back” upon itself. Lemma 7. [43, Proposition 6.3] For p, q ∈ M with kq − pk ≤ τ /2, we have r 2 dM (p, q) ≤ τ − τ 1 − kq − pk. τ (27) Proof. The first part of the proof of Proposition 6.3 in [43] establishes that for any p, q ∈ M, kq − pk ≥ dM (p, q) − (dM (p, q))2 , 2τ (28) which is satisfied only if (27) is satisfied or if r dM (p, q) ≥ τ + τ 2 1 − kq − pk τ is satisfied. We provide the following argument to complete the proof. 21 (29) For fixed p ∈ M, let us consider qb := arg kq − pk. min q∈M,dM (p,q)≥τ We know the minimizer qb exists because we are minimizing a continuous function over a compact set. We consider two cases. First, if dM (p, qb) = τ , then by (28), we will have kb q − pk ≥ τ /2. Second, if dM (p, qb) > τ , then there must exist an open neighborhood of qb on M over which the distance to p is minimized at qb. This implies that p − qb will be normal to M at qb, which by the definition of condition number (and the fact that p ∈ M) means that we must have kb q − pk ≥ 2τ . Now, for any p, q ∈ M such that kq − pk < τ /2, (27) would imply that dM (p, q) < τ and (29) would imply that dM (p, q) > τ . From the paragraph above, we see that if dM (p, q) ≥ τ , then kq − pk ≥ kb q − pk ≥ τ /2, and so we can rule out the possibility that (29) is true. Thus, (27) must hold for any p, q ∈ M with kq − pk < τ /2. For any p, q ∈ M such that kq − pk = τ /2, (27) would imply that dM (p, q) ≤ τ and (29) would imply that dM (p, q) ≥ τ . From the paragraph above involving qb, we see that any point q ∈ M satisfying both dM (p, q) ≥ τ and kq − pk = τ /2 would have to be a local minimizer of kq − pk on the convex set and in fact would have to fall into the first case, implying that dM (p, q) = τ exactly. It follows that (27) must hold for any p, q ∈ M with kq − pk = τ /2. The next lemma concerns the invertibility of ↓p within the neighborhood of p and is closely related to Lemma 5.3 in [43]. Lemma 8. For p ∈ M, ↓p is invertible on AM (p, τ /4). Proof. Lemma 4 states that the derivative of ↓p is nonsingular on AM (p, τ /2). The inverse function theorem then implies that there exists an r > 0 such that ↓p is invertible on AM (p, rτ ); without loss of generality assume that r < 1/4 (otherwise we are done). Now, suppose that there exists c > 0 and distinct points q, z ∈ M such that kq − pk = cτ , kz − pk ≤ cτ , and ↓p (q − p) =↓p (z − p). In particular, this implies that z − q ⊥ Tp . (30) That is, for any unit vector v ∈ Tp , we have hz − q, vi = 0. (31) Our goal is to show that c > 1/4. Suppose, in contradiction that indeed c ≤ 1/4. Let γ(·) be the unit-speed geodesic curve connecting q to z, such that γ(0) = q and γ(dM (q, z)) = z. By applying the fundamental theorem of calculus twice, we realize that z − q = γ (dM (q, z)) − γ(0) Z dM (q,z) = γ 0 (α) dα 0 Z dM (q,z) Z 0 = γ (0) + 0 0 Z α 00 γ (β) dβ 0 dM (q,z) Z α = γ (0) · dM (q, z) + 0 22 0 dα γ 00 (β) dβdα. Invoking Lemma 5, we can write that Z 0 dM (q,z) Z α k(z − q) − γ (0) · dM (q, z)k ≤ 0 kγ 00 (β)k dβdα 0 Z Z 1 dM (q,z) α ≤ dβdα τ 0 0 d2 (q, z) = M . 2τ Meanwhile, having kz − qk ≤ 2cτ implies, via Lemma 7, that √ dM (q, z) ≤ τ − τ 1 − 4c, which, after plugging back into (32), yields z−q 1 1√ 0 dM (q, z) − γ (0) ≤ 2 − 2 1 − 4c. So, for any unit vector v ∈ Tp , we have 0 z−q γ (0), v ≤ γ 0 (0) − z − q , v + , v dM (q, z) dM (q, z) 0 z−q , v = γ (0) − dM (q, z) 0 z−q ≤ γ (0) − dM (q, z) 1 1√ ≤ − 1 − 4c, 2 2 (32) (33) (34) (35) where the first line follows from the triangle inequality, and the second line uses (31). The last line uses (34). To reiterate, (35) is valid for any unit vector v ∈ Tp . On the other hand, Lemma 6 implies that 1 cos (∠[Tq , Tp ]) ≥ 1 − dM (p, q) r τ 2 ≥ 1 − kq − pk τ √ = 1 − 2c, (36) where the second line follows from Lemma 7, and the last line uses kq − pk = cτ . By the definition of the angle between subspaces, (36) implies that there exists a unit vector v0 ∈ Tp such that √ v0 , γ 0 (0) ≥ 1 − 2c (37) because γ 0 (0) ∈ Tq . Combining this bound with (35) for v = v0 , we realize that √ 1 − 2c ≤ 1 1√ − 1 − 4c. 2 2 This inequality is not met for any c ≤ 1/4. Thus, indeed c > 1/4. In particular, this means that ↓p is invertible on AM (p, τ /4). 23 The next three lemmas are of importance when approximating the long and short chords on M with, respectively, nearby long chords and vectors on the nearby tangent planes. Lemma 9. [12, Implicit in Lemma 4.1] Consider two pair of points a1 , a2 and b1 , b2 , all in √ RN , such that ka1 − b1 k , ka2 − b2 k ≤ r, and that ka1 − a2 k ≥ κ r, for some r, κ > 0. Then √ kU (a1 , a2 ) − U (b1 , b2 )k ≤ 4κ−1 r. Lemma 10. For a, b ∈ M with ka − bk ≤ l1 < τ /2, it holds true that r 2l1 , k ↓a v− ↓b vk ≤ τ for every unit vector v ∈ RN . Proof. It follows from (25) that k ↓a v− ↓b vk ≤ k(↓a − ↓b )(·)k2,2 = sin(∠[Ta , Tb ]). (38) On the other hand, since ka − bk ≤ l1 < τ /2, Lemma 7 implies that r 2l1 dM (a, b) ≤ τ − τ 1 − , τ and thus, using Lemma 6, we arrive at r cos (∠ [Ta , Tb ]) ≥ 1− 2l1 . τ Plugging back the estimate above into (38), we conclude that k ↓a v− ↓b vk ≤ p 2l1 /τ , as claimed. Lemma 11. Fix p ∈ M, and take two points x1 , x2 ∈ M such that kx1 −pk ≤ l1 and kx2 −x1 k ≤ l2 , l1 , l2 < τ /2. Then, we have that r 2l1 l2 kU (x1 , x2 )− ↓p U (x1 , x2 )k ≤ + . τ 2τ Proof. The triangle inequality implies that kU (x1 , x2 )− ↓p U (x1 , x2 )k ≤ kU (x1 , x2 )− ↓x1 U (x1 , x2 )k + k ↓x1 U (x1 , x2 )− ↓p U (x1 , x2 )k. (39) Since kx2 − x1 k ≤ l2 < 2τ , Lemma 3 is the right tool to bound the first term on the right hand side of (39): kU (x1 , x2 ) − ↓x1 U (x1 , x2 )k = sin (∠ [U (x1 , x2 ) , ↓x1 U (x1 , x2 )]) = sin (∠ [x2 − x1 , ↓x1 (x2 − x1 )]) l2 ≤ . 2τ (40) Since kx1 − pk ≤ l1 < τ /2, a bound on the second term follows from an application of Lemma 10: r 2l1 k ↓x1 U (x1 , x2 )− ↓p U (x1 , x2 )k ≤ . (41) τ Combining (40) and (41) immediately proves our claim. 24 We will also need the following result regarding the local properties of M, which is closely related to Lemma 5.3 in [43]. Lemma 12. Let p ∈ M and r ≤ τ /4. Then the following holds: volK (AM (p, r)) ≥ r2 1− 2 4τ K2 rK VBK , where volK (·) measures the K-dimensional volume. Proof. As in the proof of Lemma 5.3 in [43], we will show that for some r0 > 0 to be defined below, BTp (r0 ) ⊂↓p (AM (p, r)), as our claim follows directly from the inclusion above. To show the above inclusion, we use the following argument. Let us denote the inverse of ↓p on AM (p, τ /4) with g(·). From Lemma 8, ↓p is invertible on AM (p, r) and therefore ↓p (AM (p, r)) is an open set. Thus there exists s > 0 such that BTp (s) ⊂↓p (AM (p, r)). We can keep increasing s until at s = s∗ we reach a point y on the boundary of the closure of BTp (s∗ ) such that y ∈↓ / p (AM (p, r)). Consider ∗ a sequence {yi } ⊂ BTp (s ) ⊂↓p (AM (p, r)) such that yi → y when i → ∞. Note that {g(yi )} ⊂ AM (p, r) and, because every sequence in a compact space contains a convergent subsequence, there exist a convergent sebsequence {g(yik )} and x in the closure of AM (p, r) such that g(yik ) → x. Since ↓p is continuous, ↓p x = y. Therefore y =↓p x ∈↓ / p (AM (p, r)), and x ∈ / AM (p, r) and thus x is on the boundary of the closure of AM (p, r) and kxk = r. Now we invoke Lemma 3 with q = x + p to obtain that r r2 cos(∠[x, y]) ≥ 1 − 2 . 4τ It follows that s∗ = kyk = cos(∠[x, y]) · r r r2 ≥ 1 − 2 · r =: r0 , 4τ and thus BTp (r0 ) ⊂↓p (AM (p, r)). This completes the proof of Lemma 12 since volK (AM (p, r)) ≥ volK (↓p (AM (p, r))) ≥ volK (BTp (r0 )) = (r0 )K vol(BK ), where the first inequality holds because projection onto a subspace is non-expansive. We close this section with a list of properties of the Dirichlet kernel which are later used in the proof of Lemma 1 (about the condition number of the complex exponential curve). Lemma 13. (Dirichlet kernel) For z ∈ [−1/2, 1/2], the Dirichlet kernel takes z to DN (z) := sin(πN z) . sin(πz) If |z| > 2/N , then it holds that |DN (z)| ≤ α1 N, 25 with α1 ≈ 0.23. Moreover, there exists some α2 > 0 and N2 := N2 (α2 ), such that the following holds for every N > N2 : (N πz)2 |DN (z)| ≤ N 1 − + α2 N z 2 40 for all |z| ≤ 2/N . Proof. According to [45, Table 7.2], the relative peak side-lobe amplitude of the Dirichlet kernel is (approximately) −13 decibels. That is, the peak side-lobe of the Dirichlet kernel is no larger than α1 N with α1 ≈ 0.23. It is also easily verified that this peak does not occur further than 2/N away from the origin. To summarize, |DN (z)| ≤ α1 N, as long as |z| > 2/N . This completes the proof of the first inequality in Lemma 13. To prove the second inequality, assume that |z| ≤ 2/N . As N → ∞, any z ∈ [−2/N, 2/N ] approaches zero and we may replace the sine in the denominator of the Dirichlet kernel with its argument. That is, as N → ∞, z → 0 and |DN (z)| | sin(N πz)| = N N | sin(πz)| | sin(N πz)| = N π|z|(1 + O(z 2 )) | sin(N πz)| ≤ (1 + O(z 2 )) N π|z| 1 |(N πz) − 40 (N πz)3 | ≤ (1 + O(z 2 )) N π|z| (N πz)2 (1 + O(z 2 )), = 1− 40 where the third line uses the fact that 1/(1 − a) ≤ 1 + 2a for all 0 ≤ a ≤ 1/2. The second to last line holds because sin a ≤ a − a3 /40 for all 0 ≤ a ≤ 2π. As a result, for some α2 > 0 and N2 = N2 (α2 ), the following holds for every N > N2 : |DN (z)| (N πz)2 (N πz)2 2 ≤ 1− (1 + O(z )) ≤ 1 − + α2 z 2 , (42) N 40 40 which, to reiterate, holds as long as |z| ≤ 2/N . This completes the proof of Lemma 13. B Proof of Lemma 1 Here, τβ stands for the reach of the complex exponential curve, i.e., the inverse of its condition number. Note that the reach of the complex exponential curve is defined as the largest d ≥ 0 such that every point within an `2 distance less than d from β has a unique nearest point (in the `2 sense) on β. In the rest of the proof, (i) we first find a unit-speed parametrization of β, (ii) we then derive some basic properties of the reparametrized curve, and (iii) finally, we estimate τβ by studying the long and short chords on the reparametrized curve separately. 26 B.1 Unit-speed geodesic on β Let γ : R → CN be a unit-speed geodesic obtained by appropriately normalizing β. For every s ∈ R, there must exist t = t(s) ∈ R such that γs = βt . In particular, we note that β is a constant-speed curve with 1/2 fC X dβt 1 2π 3/2 = O(fC ), (2πn)2 = √ (fC (fC + 1) (2fC + 1))1/2 =: dt = v 3 N n=−f C and therefore we can simply take t(s) = vN s. This gives e−i2πfC vN s −i2π(fC −1)vN s e .. γs = βt(s) = βvN s = . i2π(fC −1)vN s e ei2πfC vN s dγs = vN ds −i2πfC · e−i2πfC vN s −i2π(fC − 1) · e−i2π(fC −1)vN s .. . i2π(fC − 1) · ei2π(fC −1)vN s i2πf · ei2πfC vN s , (43) , and (44) C d2 γs 2 = −vN 2 ds (2πfC )2 · e−i2πfC vN s (2π(fC − 1))2 · e−i2π(fC −1)vN s .. . 2 (2π(fC − 1)) · ei2π(fC −1)vN s (2πf )2 · ei2πfC vN s . (45) C To reiterate, (43) and (44) represent γ (a unit-speed parametrization of β) and its tangent vector. In addition, the curvature at any point can be computed as the magnitude of the second derivative in (45). That is, 1/2 −1 1/2 2 fC fC fC X X X d γs −1/2 2 (2πn)4 = (2πn)2 (2πn)4 =: wN = O(fC ), ds2 = vN −fC −fC (46) −fC √ where we used (45). Observe that the curvature is constant and scales like 1/ N for large N . Since γ is periodic, we will use t1 t2 to denote subtraction modulo 1 for any t1 , t2 ∈ R so that t1 − t2 = bt1 − t2 c + (t1 t2 ). (Equivalently, represents the natural subtraction on the unit circle.) We continue by recording a few simple facts about the reparametrized complex exponential curve γ. 27 B.2 Some observations about γ Note that γs as a zero-padded sequence in `2 (Z) can be interpreted as the (reversed) sequence of Fourier series coefficients of the signal in time that, at t ∈ R, takes the value γ̌s (t) = sin(πN (t vN s)) =: DN (t vN s), sin(π(t vN s)) where DN (·) is the Dirichlet kernel of width ∼ 2/N . The Dirichlet kernel is known to decay rapidly outside of an interval of width ∼ 2/N centered at the origin as studied in Lemma 13 in the Toolbox. One immediate consequence of Lemma 13 is that |hDN (· t1 ), DN (· t2 )i| = |DN (t1 t2 )| ≤ α1 N if t1 t2 ∈ [2/N, 1 − 2/N ]. (47) The first identity above holds because circular convolution of the Dirichlet kernel with itself produces the Dirichlet kernel again. Now, for any pair s1 , s2 ∈ R, consider the following correlation: |hγs1 , γs2 i| = |hγ̌s1 , γ̌s2 i| = |hDN (· vN s1 ), DN (· vN s2 )i| = |DN (vN s1 vN s2 )| , (48) where we used the Plancherel identity above. Then it follows from (47) that |hγs1 , γs2 i| ≤ α1 N if vN s1 vN s2 ∈ [2/N, 1 − 2/N ]. (49) In words, (49) captures the long-distance correlations on γ. We now turn our attention to shortdistance correlations. According to Lemma 13, for some α2 > 0 and N2 = N2 (α2 ), the following holds for every N > N2 : N 2π2 2 |hγs1 , γs2 i| ≤ N 1 − (vN s1 vN s2 ) + α2 N (vN s1 vN s2 )2 (50) 40 if vN s1 vN s2 ∈ [0, 2/N ], where we used (48) again. If vN s1 vN s2 ∈ [1 − 2/N, 1], then (50) holds with vN s1 vN s2 replaced by 1 − (vN s1 vN s2 ) = vN s2 vN s1 . The conclusion in (50) is a direct consequence of the vanishing derivative of the Dirichlet kernel at the origin. We are now in a position to estimate the reach of the complex exponential curve. B.3 Estimating τβ Consider a point γs ∈ CN on the complex exponential curve for an arbitrary s ∈ R. We deviate from γs by χ to obtain x = γs + χ where χ is assumed to be normal to the complex exponential curve at γs , that is dγs χ, = 0, (51) ds where dγs /ds is the tangent vector at γs (which was computed in (44)). We seek the largest d > 0 such that for all χ with kχk < d and satisfying (51), γs is the unique nearest point to x = γs + χ on the complex exponential curve. For γs to be the unique nearest point to x, it must hold that kχk = kx − γs k < kx − γs0 k = kχ + γs − γs0 k, ∀vN s0 vN s 6= 0. (52) Now (52) is equivalent to Re[hχ, γs0 − γs i] < N − Re[hγs , γs0 i] = N − hγs , γs0 i , ∀vN s0 vN s 6= 0, where we used the fact that the complex exponential curve lives on a sphere of radius We consider two separate cases: 28 (53) √ N in CN . Long distances vN s0 vN s ∈ [2/N, 1 − 2/N ]: In this case, it follows from (49) that |hγs , γs0 i| ≤ α1 N, with α ≈ 0.23 as in Lemma 13. As a result, for the inequality in (53) to hold for long distances, it suffices that the following holds: √ kχk · (kγs k + kγs0 k) = kχk · 2 N < (1 − α1 ) N, ∀vN s0 vN s ∈ [2/N, 1 − 2/N ], which is guaranteed as long as kχk < 1 − α1 √ (1 − α1 ) N √ = N. 2 2 N (54) Short distances vN s0 vN s ∈ [0, 2/N ] ∪ [1 − 2/N, 1]: Without loss of generality assume that vN s0 − vN s ∈ [0, 2/N ]. In this case, we first note that * Z 0 + s dγη |hχ, γs0 − γs i| = χ, dη ds s * Z 0 + Z s0 Z η 2 s d γξ dγs dη + dξdη = χ, 2 ds ds s s s * + Z s0 Z η 2 d γξ dγs 0 + dξdη = χ, (s − s) 2 ds s s ds * Z 0 Z + s η 2 d γξ = χ, dξdη 2 s s ds Z s0 Z η 2 d γξ ≤ kχk · ds2 dξdη s s Z s0 Z η = wN kχk dξdη s s − s|2 2 (vN s0 vN s)2 = wN kχk · , 2 2vN = wN kχk · |s0 where we used the fundamental theorem of calculus twice. The fourth line above uses the fact that χ is normal to the tangent of γ at s, namely hχ, dγs /dsi = 0. The sixth line uses the fact that curvature of γ is constant and was calculated in (46). Recall that vN s0 vN s ≤ 2/N . Therefore, for a fixed α2 > 0 and N > N2 = N2 (α2 ), (50) dictates that 2 2 N 2π2 0 0 vN s vN s + α2 N vN s0 vN s . hγs , γs i ≤ N 1 − 40 As a result, for the inequality (53) to hold for short distances, it suffices that the following statement holds: wN kχk · 2 2 (vN s0 vN s)2 N 3π2 < vN s0 vN s − α2 N vN s0 vN s , 2 40 2vN 29 ∀vN s0 − vN s ∈ [0, 2/N ], which in turn holds if kχk < 2 2 N vN π 2 N 3 vN − 2α2 · . · 20 wN wN (55) A lower bound on the reach: From (55) and (54), we overall observe that if √ 2 !! √ 2 √ N vN 1 − α1 π 2 N 5/2 vN kχk < min − 2α2 · N =O N , , · 2 20 wN wN then (53) holds uniformly regardless of the value of s. Therefore, we find the following lower bound on the reach of the complex exponential curve: √ 2 !! 2 √ N vN 1 − α1 π 2 N 5/2 vN τβ ≥ min − 2α2 · N, , · 2 20 wN wN which, to reiterate, holds for some α2 and every N > N2 . Because the factor multiplying α2 scales with N −2 whereas the factor multiplying π 2 /20 scales with 1, the following holds for every N > Nsine for some Nsine : ! √ 2 √ 1 − α1 π 2 N 5/2 vN τβ ≥ min , · N =O N . (56) 2 40 wN An upper bound on the reach: Since the complex exponential curve lives on the unit sphere in CN , γs is normal to γ at any arbitrary s. This immediately implies the following upper bound on the reach: √ τβ ≤ N . (57) Together, (56) and (57) complete the proof of Lemma 1. C Proof of Theorem 2 It is easily verified that our objective is to find an upper bound for ( ) P sup |kΦyk − 1| > , y∈U (M) when ≤ 1/3. The remainder of this section is divided to two parts. In the first part, we construct a sequence of increasingly finer nets for M. This is in turn used to construct a sequence of covers for the set of all (normalized) secants in M. In the second part, we apply a chaining argument that utilizes this later sequence of covers to prove Theorem 2. C.1 Sequence of covers for U (M) For η > 0, let C0 (η) ⊂ M denote a minimal η-net for M over all η-nets that are a subset of M. Upper and lower bounds for #C0 (η) = NM (η) are known for sufficiently small η [43], where #C0 (η) denotes the cardinality of C0 (η). Since the claim below slightly differs from the one in [43], the proof is included here. 30 Lemma 14. When η ≤ τ /2, it holds that #C0 (η) ≤ where θ (α) := √ 2 θ (η/4τ ) η K VM =: c0 (η), VBK (58) 1 − α2 for |α| ≤ 1. Proof. Using (24) and a simple volume comparison argument, we observe that #C0 (η) = NM (η) ≤ PM (η) ≤ VM . inf p∈M volK (AM (p, η/2)) Since η/2 ≤ τ /4, we can apply Lemma 12 from the Toolbox (with r = η/2) and obtain that #C0 (η) ≤ 1− = η2 VM K 2 16τ 2 2 θ(η/4τ )η K η K 2 VBK VM , VBK This completes the proof of Lemma 14. By replacing η with 4−j η, we can construct a sequence of increasingly finer nets for M, {Cj (η)}, such that Cj (η) ⊂ M is a (4−j η)-net for M, for every j ∈ N. In light of Lemma 14, we have that #Cj (η) ≤ 4jK · c0 (η). (59) Construction of a sequence of covers for U (M) demands the following setup. For η 0 > 0 and j ∈ N, let C j (η 0 ) denote a minimal (2−j η 0 )-net for BK . For p ∈ Cj (η), we can naturally map C j (η 0 ) to live in the K-dimensional unit ball along Tp (and anchored at the origin). We represent this set of vectors by C j,p (η 0 ) and define [ Cj0 (η, η 0 ) := C j,p (η 0 ), p∈Cj (η) which forms a (2−j η 0 )-net for the unit balls along the tangent spaces at every point in Cj (η) ⊂ M. For δ > 0, let us specify η p and η 0 as functions of δ. For C2 , C3 > 0 to be set later, take η = η(δ) = 2 2 0 0 C2 τ δ and η = η (δ) = C3 η/τ = C2 C3 δ. Now, for every j ∈ N, simply set Tj (δ) := U (Cj (η)) ∪ Cj0 (η, η 0 ). It turns out that U (Cj (η)), the set of all directions in Cj (η), provides a net for the directions of long chords on M. In contrast, Cj0 (η, η 0 ) forms a net for the directions in U (M) that correspond to the short chords on M. It is therefore not surprising that {Tj (δ)} proves to be a sequence of increasingly finer covers for U (M). This discussion is formalized in the next lemma. We remark that Lemma 15 holds more generally for all constants C2 , C3 that satisfy the conditions listed in the proof. √ Lemma 15. Set C2 = 0.4 and C3 = 1.7 − 2. For every j ∈ N, Tj (δ), as constructed above, is a (2−j δ)-net for U (M), when δ ≤ 1/2. Under the mild assumption that VM 21 K √ ≥ , (60) τK 2 K 31 it also holds that #Tj (δ) ≤ 2 · 42jK √ !2K VM 2 6.12 K =: tj (δ). δ2 τK (61) Proof. Consider two arbitrary but distinct points x1 , x2 ∈ M. For C4 > 0 to be p set later in the proof, we separate the treatment of long and short chords, i.e., kx2 − x1 k/τ > C4 η/τ =: γ and kx2 − x1 k/τ ≤ γ, and in this strategy we follow [3, 12]. Short chords are distinct in that, as we will see later, they have to be approximated with nearby tangent vectors. For convenience, let us also define Uγl (M) := {U (z1 , z2 ) : kz2 − z1 k > γτ, z1 , z2 ∈ M}, Uγs (M) := {U (z1 , z2 ) : 0 < kz2 − z1 k ≤ γτ, z1 , z2 ∈ M}. Of course, Uγl (M) ∪ Uγs (M) = U (M), although their intersection might not be empty. p Suppose that kx2 − x1 k/τ > γ = C4 η/τ so that U (x1 , x2 ) ∈ Uγl (M). Since C0 (η) is an η-net for M, there exist p1 and p2 in C0 (η) such that kx1 − p1 k, kx2 − p2 k ≤ η. It then follows from Lemma 9 (with a1 = x1 , a2 = x2 , b1 = p1 , and b2 = p2 ) that r 4 η 4C2 δ kU (x1 , x2 ) − U (p1 , p2 )k ≤ = . (62) C4 τ C4 Now, assuming that 4C2 = C4 , (63) and leveraging the fact that the choice of x1 , x2 ∈ M was arbitrary, we conclude that U (C0 (η)) is a δ-net for Uγl (M). p p On the other hand, suppose that 0 < kx2 − x1 k/τ ≤ γ = C4 η/τ = 4C2 η/τ so that U (x1 , x2 ) ∈ Uγs (M). We assume that η = C22 δ 2 < min τ 1 1 , 2 64C2 2 , (64) so that, in particular, kx2 −x1 k < τ /2. Since C0 (η) is an η-net for M, there exists a point p ∈ C0 (η) √ such that kx1 − pk ≤ η < τ /2. Lemma 11 (with l1 = η and l2 = 4C2 τ η) then implies that the direction of the chord connecting x1 to x2 can be approximated with a tangent vector in Tp , that is r r 2η η √ + 2C2 = 2 + 2C2 C2 δ. (65) kU (x1 , x2 )− ↓p U (x1 , x2 )k ≤ τ τ Recall that C 0,p (η 0 ) is an η 0 -net for the unit ball centered at p and along Tp . So, there also exists a vector v ∈ C 0,p (η 0 ) such that k ↓p U (x1 , x2 ) − vk ≤ η 0 = C2 C3 δ. Using the triangle inequality, we therefore arrive at √ kU (x1 , x2 ) − vk ≤ 2 + 2C2 + C3 C2 δ. (66) Assuming that √ 2 + 2C2 + C3 = C2−1 , (67) and leveraging the fact that the choice of x1 , x2 ∈ M was arbitrary, we conclude that C00 (η, η 0 ) is a δ-net for Uγs (M). Overall, under (63), (64), and (67), T0 (δ) = U (C0 (η)) ∪ C00 (η, η 0 ) is a δ-net for U (M). By repeating the argument above (with η, δ, η 0 , γ replaced with η/4j , δ/2j , η 0 /2j , γ/2j ) we observe that Tj (δ) is a (2−j δ)-net for U (M), for every j ∈ N. In particular, the choice of 32 √ C2 = 0.4, C3 = 1.7 − 2, C4 = 1.6 satisfies the conditions above for every δ ≤ 1/2 and completes the proof of the first statement in Lemma 15. In order to bound the cardinality of Tj (δ), we begin with estimating #Cj0 (η, η 0 ). According to Lemma 2, we can write that K 3 · 2j 0 0 #Cj η, η ≤ · #Cj (η) , (68) η0 which holds assuming that η 0 ≤ 1, i.e., C2 C3 δ ≤ 1. (Our choice of C2 , C3 above satisfies this condition.) It is possible now to write that #Tj (δ) ≤ (#Cj (η))2 + #Cj0 (η, η 0 ) K 3 · 2j 2 ≤ (#Cj (η)) + · #Cj (η) η0 K ! 3 · 2j jK · 4jK c0 (η), ≤ 2 max 4 c0 (η), η0 (69) where we used (68) in the second line and (59) in the last line. To guarantee that the first term dominates the maximum in (69), it suffices (according to the definition of c0 (η) in (58)) to enforce that K K 2 VM 3 ≥ , θ(η/4τ )η VBK η0 which, after plugging in for η and η 0 in terms of δ and using the hypothesis that δ ≤ 1, is satisfied under the mild assumption that VM 3C2 K K ≥ 2.5 VBK ≥ VBK . (70) τK 2C3 The assumption in (70) allows us to simplify (69) and obtain that #Tj (δ) ≤ 2 · 42jK c20 (η). (71) It follows from (71) and the definition of c0 (η) in (58) that 2K 2 VM 2 12.52 2K VM 2 2jK 2jK #Tj (δ) ≤ 2 · 4 ≤2·4 , VBK τ δ2 VBK θ(C22 /4)C22 τ δ 2 where we used the fact that δ ≤ 1. We remind the reader that K/2 4π π K/2 2eπ K/2 ≤ ≤ VBK = , K +2 K +2 Γ K2 + 1 (72) where the inequalities follow from the fact that (K/e)K−1 ≤ Γ (K) ≤ (K/2)K−1 for K ∈ N [44]. Here Γ(·) denotes the Gamma function. The above inequality leads us to √ !2K 6.12 K VM 2 2jK #Tj (δ) ≤ 2 · 4 , δ2 τK √ which holds under the mild assumption that VM /τ K ≥ (21/ K)K . Indeed, this assumption is obtained by plugging our choice of C2 , C3 into (70). This completes the proof of Lemma 15. 33 C.2 Applying the chaining argument Every y ∈ U (M) can be represented with a chain of points in {Tj (δ)}. Let πj (y) be the nearest point to y in Tj (δ). This way we obtain a sequence {πj (y)} that represents y via an almost surely convergent telescoping sum, that is X y = π0 (y) + (πj+1 (y) − πj (y)) . (73) j∈N Note that, for every j ∈ N and every y ∈ M, the length of the chord connecting πj (y) to πj+1 (y) is no longer than 2−j+1 δ. We are now ready to state a generic chaining argument that allows us to bound the failure probability of obtaining a stable embedding of M in terms of its geometrical properties. The interested reader is referred to [56] for more information about the generic chaining. Lemma 16. Fix 0 < δ < 1 < ≤ 1/3, and 2 > 0 such that 1 + 2 = . Choose C5 , C6 > 0 so 1+C5 that 1 /δ ≥ 1−C and 2 /δ ≥ C6 . Then, under (60), we have that 5 ( ) sup |kΦyk − 1| > P ≤ 2t0 (δ) · max P {|kΦt0 k − kt0 k| > C5 1 kt0 k} t0 ∈T0 (δ) y∈U (M) +2 X t2j+1 (δ) · j∈N max (tj ,sj )∈Qj (δ) P kΦsj − Φtj k > 8−1 C6 (j + 1)ksj − tj k , (74) where {tj (δ)} were previously defined in Lemma 15. For j ∈ N, Qj (δ) is defined as Qj (δ) := {(tj , sj ) : πj (y) = tj and πj+1 (y) = sj for some y ∈ U (M)} . Proof. For notational convenience, let us denote the infinite sum in (73) by Σ(y). Then, using the triangle inequality, we observe that ( ) sup kΦyk > 1 + = P sup kΦπ0 (y) + ΦΣ(y)k > 1 + 1 + 2 P y y∈U (M) ≤ P sup kΦπ0 (y)k + sup kΦΣ(y)k > 1 + 1 + 2 y y ≤ P sup kΦπ0 (y)k − 1 > 1 + P sup kΦΣ(y)k > 2 , y y and similarly, P inf kΦyk < 1 − = P inf kΦπ0 (y) + ΦΣ(y)k < 1 − 1 − 2 y y∈U (M) ≤ P inf kΦπ0 (y)k − sup kΦΣ(y)k < 1 − 1 − 2 y y ≤ P sup 1 − kΦπ0 (y)k > 1 + P sup kΦΣ(y)k > 2 . y We can therefore argue that ( ) P sup |kΦyk − 1| > y∈U (M) y ≤ P sup kΦyk > 1 + + P inf kΦyk < 1 − y y ≤ 2 P sup |kΦπ0 (y)k − 1| > 1 y 34 + 2 P sup kΦΣ(y)k > 2 . y (75) Consider the first probability on the last line of (75): ) ( P sup |kΦπ0 (y)k − 1| > 1 ≤ P sup |kΦπ0 (y)k − kπ0 (y)k| + sup |kπ0 (y)k − 1| > 1 y y∈U (M) y ≤ P sup |kΦπ0 (y)k − kπ0 (y)k| > 1 − δ y |kΦπ0 (y)k − kπ0 (y)k| 1 − δ ≤ P sup > kπ0 (y)k 1+δ y |kΦπ0 (y)k − kπ0 (y)k| ≤ P sup > C5 1 kπ0 (y)k y |kΦt0 k − kt0 k| ≤ P max > C5 1 kt0 k t0 ∈T0 (δ) ≤ #T0 (δ) · max P {|kΦt0 k − kt0 k| > C5 1 kt0 k} t0 ∈T0 (δ) where the first line uses the triangle inequality. The second and third lines hold on account of T0 (δ) being a net for a subset of SN −1 , namely U (M). An application of the union bound gives the last line above. Now consider the second probability on the last line of (75). By the definition of Σ(y), we observe that ( ) sup kΦΣ(y)k > 2 P y∈U (M) X Φπj+1 (y) − Φπj (y) > 2 = P sup y X ≤P max kΦsj − Φtj k > C6 δ (tj ,sj )∈Qj (δ) j X X =P max kΦsj − Φtj k > C6 (j + 1)2−j−2 δ (tj ,sj )∈Qj (δ) j j X ≤ P max kΦsj − Φtj k > C6 (j + 1)2−j−2 δ (tj ,sj )∈Qj (δ) j ≤ X j ≤ X j P max (tj ,sj )∈Qj (δ) 2 #Tj+1 (δ) kΦsj − Φtj k > 8 max (tj ,sj )∈Qj (δ) −1 C6 (j + 1)ksj − tj k P kΦsj − Φtj k > 8−1 C6 (j + 1)ksj − tj k . The third line above uses the triangle inequality and the assumption on 2 , while the fifth and last lines use the union bound. It can be easily verified that the infinite sum on the right hand side of the inequality in the fourth line equals one. In the sixth line, we used the observation that (tj , sj ) ∈ Qj (δ) implies that ksj − tj k ≤ 2−j δ + 2−j−1 δ ≤ 2−j+1 δ. Having upper bounds for both 35 terms on the last line of (75), we overall arrive at ( ) P sup |kΦyk − 1| > ≤ 2#T0 (δ) · max P {|kΦt0 k − kt0 k| > C5 1 kt0 k} t0 ∈T0 (δ) y∈U (M) +2 X 2 #Tj+1 (δ) j max (tj ,sj )∈Qj (δ) P kΦsj − Φtj k > 8−1 C6 (j + 1)ksj − tj k . From Lemma 15, #Tj (δ) ≤ tj (δ). This establishes Lemma 16. There are two type of probabilities involved in the upper bound above. One controls the large deviations of kΦt0 k from its expectation, and the other corresponds to very large (one sided) deviations of kΦsj − Φtj k from its expectation. As claimed in the next lemma and proved in Appendix D, both of these probabilities are exponentially small when M is large enough. Lemma 17. Fix 0 ≤ λ ≤ 1/3 and λ0 ≥ 1/5. Then, for fixed y ∈ RN , we have M λ2 P {|kΦyk − kyk| > λkyk} ≤ 2e− 6 M λ0 P kΦyk > 1 + λ0 kyk ≤ e− 7 . (76) (77) p Now fix ≤ 1/3 and set 1 = 9/10. Taking C5 = 6/7, C6 = 16, δ = /160 and finally assuming (60) guarantees that Lemma 15 is in force. Under this setup, note that an upper bound for the first term on the right hand side of (74) can be found by applying (76) (after plugging in for C5 ): ( ) r M 2 6 1 2t0 (δ) · max P |kΦt0 k − kt0 k| > 1 kt0 k ≤ 2t0 (δ) · 2e− 7 , 7 t0 ∈T0 (δ) and, assuming that M ≥ 14−2 1 log t0 (δ), (78) we arrive at ( r 2t0 (δ) · max P |kΦt0 k − kt0 k| > t0 ∈T0 (δ) ) M 2 6 1 1 kt0 k ≤ 4e− 14 . 7 (79) In order to bound the second term on the right hand side of (74), we proceed as follows. Consider the maximum inside the summation. After plugging in for C6 and applying (77), we can bound this maximum as max (tj ,sj )∈Qj (δ) P {kΦsj − Φtj k > 2(j + 1)ksj − tj k} ≤ e− (2j+1)M 7 . Using the estimate above and Lemma 15, we get an upper bound for the second term on the right hand side of (74): X 2 t2j+1 (δ) · max P {kΦsj − Φtj k > 2(j + 1)ksj − tj k} (tj ,sj )∈Qj (δ) j∈N M ≤ 2t20 (δ)e− 7 44K X 2 44jK e− 7 jM . j∈N 36 (80) Assuming that M ≥ max(32 log t0 (δ), 310K) (81) allows us to continue simplifying (80), therefore arriving at X 2 t2j+1 (δ) · max P {kΦsj − Φtj k > 2(j + 1)ksj − tj k} (tj ,sj )∈Qj (δ) ≤ 4e −M 17 . (82) We can now combine (79) and (82) to obtain M 2 M 2 M 1 1 P sup |kΦyk − 1| > ≤ 4e− 14 + 4e− 17 ≤ 8e− 14 , y where the second inequality follows since 1 ≤ 1/3 and thus 21 /14 ≤ 1/17. In particular, to achieve a failure probability of at most ρ ≤ 1, we need M ≥ 14−2 1 log (8/ρ) . (83) Assuming that (60) holds and that ≤ 1/3, we verify that (81) may be absorbed into (78) (i.e., (78) implies (81)). We are now left with (78) and (83), which are in turn lumped into a single lower bound on M (after plugging in for δ), that is √ ! ! K 8 2 M ≥ 18−2 max log(2VM ) + 24K + 2K log , log 2 τ ρ ! √ 2K 6.12 K 8 −2 2 ≥ 18 max log 2VM , log 2 τδ ρ = 18−2 max (log t0 (δ), log (8/ρ)) ≥ 14−2 1 max (log t0 (δ), log (8/ρ)) . (84) Therefore, we proved that ( ) sup |kΦyk − 1| > P ≤ ρ, y∈U (M) provided that M satisfies (84). This completes the proof of Theorem 2. D Proof of Lemma 17 The proof is elementary. It is easily verified that E kΦyk2 = kyk2 , and we then note that P {|kΦyk − kyk| > λkyk} = P {kΦyk > (1 + λ)kyk} + P {kΦyk < (1 − λ)kyk} n o n o ≤ P kΦyk2 > (1 + λ)kyk2 + P kΦyk2 < (1 − λ)kyk2 −M 2 ≤ 2e 3 λ2 − λ3 2 ≤ 2e− M λ2 2 ( 12 − 19 ) ≤ 2e− M λ2 6 , 37 where the third line uses a well-known concentration bound [1]. The fourth line holds because λ ≤ 1/3. This establishes the first inequality in Lemma 17. For the second inequality, assume, without loss of generality, that kyk = 1. We begin by observing that P kΦyk > 1 + λ0 kyk = P kΦyk > 1 + λ0 n o ≤ P kΦyk2 > 1 + 2λ0 ( ) M X = P M −1 n2i − 1 > 2λ0 i=1 =P (M X ) n2i − M > 2λ0 M , (85) i=1 where n1 , n2 , · · · , nM are zero-mean and unit-variance Gaussian random variables. The third line above follows since the entries of the vector Φy are distributed as i.i.d. zero-mean Gaussians with variance of 1/M . We now recall Lemma 1 in [38], which states that ) (M X √ 2 (86) P ni − M > 2 M α + 2α ≤ e−α , i=1 for α > 0. Comparing the last line in (85) to the inequality above, we observe that taking α= 2 M √ 1 + 4λ0 − 1 4 allows us to continue simplifying (85) to obtain that (M ) X √ 0 2 P kΦyk > 1 + λ ≤P ni − M > 2 M α + 2α ≤ e−α . (87) i=1 It is easily verified that √ 1 + 4λ0 − 1 ≥ (3 − α≥ and consequently, √ √ 5) λ0 when λ0 ≥ 1/5. It follows that √ M · (3 − 5)2 λ0 ≥ M λ0 /7, 4 M λ0 P kΦyk > 1 + λ0 kyk ≤ e− 7 , as claimed. This establishes the second inequality in Lemma 17 and completes the proof. E Proof of Theorem 3 Fix α ∈ [1 − , 1 + ]. We consider any two points wa , wb ∈ M such that kΦwa − Φwb k = α, kwa − wb k and supposing that x is closer to wa , i.e., kx − wa k ≤ kx − wb k , 38 (88) but y = Φx + n is closer to Φwb , i.e., ky − Φwb k ≤ ky − Φwa k , we seek the maximum value that kx − wb k may take. In other words, we wish to bound the worst possible “mistake” (according to our error criterion) between two candidate points on the manifold whose distance is scaled by the factor α. This can be posed in the form of an optimization problem max x∈RN ,wa ,wb ∈M kx − wb k s.t. kx − wa k ≤ kx − wb k , ky − Φwb k ≤ ky − Φwa k , kΦwa − Φwb k = α. kwa − wb k For simplicity, we may expand the constraint set to include all wa , wb ∈ RN ; the solution to this larger problem is an upper bound for the solution to the case where wa , wb ∈ M. This leaves max x,wa ,wb ∈RN kx − wb k s.t. ky − Φwb k ≤ ky − Φwa k , kΦwb − Φwa k = α. kwb − wa k where we also ignored the first constraint (because of its relation to the objective function). Under the constraints above, the objective satisfies kx − wb k ≤ kx − wa k + kwb − wa k = kx − wa k + kΦwb − Φwa k/α ≤ kx − wa k + 2ky − Φwa k/α ≤ kx − wa k + 2(kΦx − Φwa k + knk)/α 2σM (Φ) 2knk ≤ kx − wa k + kx − wa k + 1− 1− 2knk 1 ≤ . (2σM (Φ) + 1) kx − wa k + 1− 1− The first line follows from the triangle inequality. The first identity above uses the first constraint. The first constraint (via the triangle inequality) implies that kΦwb − Φwa k ≤ 2ky − Φwa k and the third line thus follows. The fourth line uses the triangle inequality one more time. The fifth line follows after considering the possible range of α. To reiterate, the above conclusion holds for any observation x that could be mistakenly paired with wb instead of wa (under a Φ that scales the distance kwa − wb k by α). This completes the proof of Theorem 3 after noting that (1−)−1 ≤ 1+2 when ≤ 1/2. F Proof of Proposition 1 Set δ := (1 + ) (σm (Φ))−1 , and let x = e1 + δu, 39 where kuk ≤ 1 belongs to the row span of Φ and satisfies Φx = Φ(e1 + δu) = 0. Finding such u is possible because kΦe1 k ≤ (1 + )ke1 k = 1 + = δ · σm (Φ) = δ · σm (Φ)kvk ≤ δkΦvk, for every unit vector v in the row span of Φ. The first inequality holds because e1 , 0 ∈ M and Φ stably embeds M. The second equality holds by our choice of δ, and the last inequality holds because v belongs to the row span of Φ. With our choice of x above, we have Φx = 0 and therefore x b = 0. On the other hand, kx − x∗ k ≤ kx − e1 k ≤ δ. It follows that kx − x bk kx − x bk kxk 1−δ 1 1 ≥ ≥ ≥ ≥ = σm (Φ). ∗ kx − x k kx − e1 k δ δ 2δ 2(1 + ) Indeed, one can verify that δ ≤ 1/2 because (by hypothesis) ≤ 1/3 and σm (Φ) ≥ 8/3. This immediately implies the second to last inequality above. This completes the proof of Proposition 1. G Proof of Theorem 4 Our success in stably embedding M via random linear measurements (and what distinguished Theorem 2 from embedding of a point cloud) relied on the smoothness of the manifold. This assumption enabled us to control the behavior of short chords on M. However, x does not generally belong to the manifold and hence, in general, we cannot control the direction of short chords connecting x to M. To deal with this issue, we proceed as follows. For fixed γ > 0 to be specified later, define Mγ := {z ∈ M : kz − xk > γτ } , and let MC γ := M\Mγ , i.e., the complement of Mγ in M. Note that one of the two sets may be empty. Our first step towards a proof is to show that, for every z ∈ Mγ with an appropriately chosen γ, we have (1 − ) kz − xk ≤ kΦz − Φxk ≤ (1 + ) kz − xk . (89) In other words, we first study the stable embedding of the directions of all the chords connecting x to Mγ , namely U (Mγ , x), for an appropriate γ. This is addressed next. Lemma 18. Choose 0 < ≤ 1/3 and 0 < ρ < 1. Conveniently assume that (21) holds. If √ ! ! K 8 M ≥ 18−2 max 11K + K log + log VM , log , 2 τ ρ (90) then, except with a probability of at most ρ, (89) holds for every z ∈ M/40 . Proof. The proof strategy is identical to that in Appendix C. We will prove that (89) holds for every z ∈ M/40 , with high probability and provided that M is large enough. As before, this is achieved by finding an upper bound on ( ) P |kΦyk − 1| > , sup y∈U (M/40 ,x) for ≤ 1/3. 40 (91) We begin again by constructing a sequence of increasingly finer covers for U (Mγ , x), with γ to be set later. √ We denote this sequence by {Lj (δ)}—each Lj (δ) is a (2−j δ)-net for U (Mγ , x). For 0 < δ ≤ 1/ 2, set η = δ 2 τ and γ = 4δ. We form {Lj (δ)} from {Cj (η)}, the sequence of covers for M constructed in Appendix C.1. Indeed, the same argument in that section proves that U (Cj (η), x) is a (2−j δ)-cover for U (Mγ , x). It also holds that √ !K K 2 VM K jK jK #Lj (δ) ≤ #Cj (η) ≤ 4 ≤4 VM =: lj (δ). (92) 2 2 2 θ(δ /4)δ τ VBK δ τ As before, we can represent every y ∈ U (Mγ , x) with an infinite chain of points from the sequence of covers {Lj (δ)}. After setting δ = /160, using the same argument as the one in Appendix C.2, and exploiting the estimates above, one can verify that the failure probability in (91) is at most ρ, provided that (90) holds. We now combine Lemma 18 and an elementary argument to complete the proof of Theorem 4. It is possible to recognize two different cases: when x b ∈ MC b ∈ M/40 . Clearly, /40 and when x kx − x∗ k ≤ kx − x bk ≤ τ , 40 when x b ∈ MC /40 . (93) If, however, x b ∈ M/40 , then a more detailed analysis is required. An application of Lemma 17 implies that (89) holds for z = x∗ , except with a probability of at most ρ and provided that M ≥ 6−2 log(2/ρ). Suppose the assumptions in Lemma 18 are met. Therefore, (89) holds for every z ∈ M/40 ∪ {x∗ }, except with a probability of at most 2ρ. Also, by the definition of x∗ and x b, it holds true that kx − x∗ k ≤ kx − x bk and k(Φx + n) − Φb xk ≤ k(Φx + n) − Φx∗ k . Now, combining all these bounds and using several applications of the triangle inequality we obtain that kx − x bk ≤ (1 − )−1 kΦx − Φb xk ≤ (1 − )−1 k(Φx + n) − Φb xk + (1 − )−1 knk ≤ (1 − )−1 k(Φx + n) − Φx∗ k + (1 − )−1 knk ≤ (1 − )−1 kΦx − Φx∗ k + 2 (1 − )−1 knk 1+ ≤ kx − x∗ k + 2 (1 − )−1 knk . 1− Since ≤ 1/3, one can easily check that (1 − )−1 ≤ 1 + 2 and 1+ ≤ 1 + 3. 1− Consequently, we obtain that kx − x bk ≤ (1 + 3) kx − x∗ k + (2 + 4) knk , when x b ∈ M/40 . Combining (93) and (94), we overall obtain that τ kx − x bk ≤ max , (1 + 3) kx − x∗ k + (2 + 4) knk 40 τ ≤ (1 + 3) kx − x∗ k + + (2 + 4) knk , 40 41 (94) (95) which, to emphasize, is valid under the assumptions of Lemma 18 and except for a probability of at most 2ρ. On the other hand, according to Theorem 3 and the remarks that followed it (see (19)), it holds that ! r N kx − x bk ≤ (1 + 2) 2 + 5 kx − x∗ k + (2 + 4)knk, (96) M except for a probability of at most 2ρ and as long as both (8) and (9) hold. From (95) and (96), we conclude that ! ! r τ N ∗ ∗ kx − x bk ≤ min (1 + 3) kx − x k + , (1 + 2) 2 + 5 kx − x k + (2 + 4) knk , 40 M except for a probability of at most 4ρ and as long as both (9) and (21) hold. This completes the proof of Theorem 4. H Proof of Theorem 5 Using the triangle inequality and (22), we have kb x − x∗ k ≤ kx − x bk + kx − x∗ k τ ≤ min (2 + 3) kx − x∗ k + , 40 r (2 + 4) ! ! N + 6 + 10 kx − x∗ k + (2 + 4) knk . M (97) Now, since both x b and x∗ belong to M, we can invoke Lemma 7 from the Toolbox, which states that if kb x − x∗ k ≤ τ /2, then p x − x∗ k /τ . (98) dM (b x, x∗ ) ≤ τ − τ 1 − 2 kb To apply this lemma, it is sufficient to know that (2 + 3) kx − x∗ k + τ + (2 + 4) knk ≤ τ /2, 40 i.e., that τ 1 + 2 kx − x k + knk ≤ 1 + 3/2 4 ∗ 1 − /20 1 + 3/2 . For the sake of neatness, we may tighten this condition to kx − x∗ k+ 10 9 knk ≤ 0.163τ , which implies the sufficient condition above (since ≤ 1/3). Thus, if kx − x∗ k and knk are sufficiently small (on the order of the condition number τ ), then we may combine (97) and (98), giving dM (b x, x∗ ) v ! ! ! r u u 2 τ N ≤ τ − τ t1 − min (2 + 3) kx − x∗ k + , (2 + 4) + 6 + 10 kx − x∗ k + (2 + 4) knk τ 40 M v ! ! r u u N 4 + 6 4 + 8 = τ − τ t1 − min kx − x∗ k + , τ −1 (4 + 8) + 12 + 20 kx − x∗ k − knk. τ 20 M τ (99) 42 Under the assumption that kx − x∗ k + 10 9 knk ≤ 0.163τ , it follows that the term inside the square root in the last line above must be nonnegative, and therefore (23) holds. This completes the proof of Theorem 5. References [1] D. Achlioptas. Database-friendly random projections. In Proc. Symp. on Principles of Database Systems (PODS ’01), pages 274–281. ACM, 2001. [2] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. A simple proof of the Restricted Isometry Property for random matrices. Constr. Approx., 28(3):253–263, 2008. [3] R. G. Baraniuk and M. B. Wakin. Random projections of smooth manifolds. Found. Comput. Math., 9(1):51–77, 2009. [4] D. S. Broomhead and M. J. Kirby. The Whitney Reduction Network: A method for computing autoassociative graphs. Neural Comput., 13(11):2595–2616, November 2001. [5] E. Candès, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inform. Theory, 52(2):489–509, February 2006. [6] E. Candès, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math., 59(8):1207–1223, August 2006. [7] E. Candès and T. Tao. Decoding via linear programming. IEEE Trans. Inform. Theory, 51(12):4203–4215, December 2005. [8] E. Candès and T. Tao. Near optimal signal recovery from random projections: Universal encoding strategies? IEEE Trans. Inform. Theory, 52(12):5406–5425, December 2006. [9] E. J. Candès. The Restricted Isometry Property and its implications for compressed sensing. Compte Rendus de l’Academie des Sciences, Paris, 346:589–592, 2008. [10] E. J. Candès and M. B. Wakin. An introduction to compressive sampling. IEEE Signal Proc. Mag., 25(2):21–30, 2008. [11] M. Chen, J. Silva, J. Paisley, C. Wang, D. Dunson, and L. Carin. Compressive sensing on manifolds using a nonparametric mixture of factor analyzers: Algorithm and performance bounds. IEEE Trans. Signal Proc., 58(12):6140–6155, 2010. [12] K. L. Clarkson. Tighter bounds for random projections of manifolds. In Proc. Symp. Comput. Geom., pages 39–48. ACM, 2008. [13] A. Cohen, W. Dahmen, and R. DeVore. Compressed sensing and best k-term approximation. J. Amer. Math. Soc., 22(1):211–231, 2009. [14] M. A. Davenport, M. F. Duarte, M. B. Wakin, J. N. Laska, D. Takhar, K. F. Kelly, and R. G. Baraniuk. The smashed filter for compressive classification and target recognition. In Proc. Comput. Imaging V at SPIE Electronic Imaging, January 2007. [15] M. A. Davenport, C. Hegde, M. F. Duarte, and R. G. Baraniuk. Joint manifolds for data fusion. IEEE Trans. Image Proc., 19(10):2580–2594, 2010. [16] M. A. Davenport and M. B. Wakin. Analysis of orthogonal matching pursuit using the restricted isometry property. IEEE Trans. Inform. Theory, 56(9):4395–4401, 2010. [17] R. DeVore, G. Petrova, and P. Wojtaszczyk. Instance-optimality in probability with an `1 -minimization decoder. Appl. Comput. Harmonic Anal., 27(3):275–288, 2009. [18] D. Donoho. Compressed sensing. IEEE Trans. Inform. Theory, 52(4), April 2006. [19] D. Donoho and Y. Tsaig. Extensions of compressed sensing. Signal Proc., 86(3):533–548, March 2006. [20] D. L. Donoho and C. Grimes. Image manifolds which are isometric to Euclidean space. J. Math. Imaging Comp. Vision, 23(1):5–24, July 2005. [21] M. Duarte, M. Davenport, M. Wakin, J. Laska, D. Takhar, K. Kelly, and R. Baraniuk. Multiscale random projections for compressive classification. In Proc. IEEE Conf. Image Proc. (ICIP), September 2007. [22] M. F. Duarte, M. A. Davenport, D. Takbar, J. N. Laska, T. Sun, K. F. Kelly, and R. G. Baraniuk. Single-pixel imaging via compressive sampling. IEEE Signal Proc. Mag., 25(2):83–91, 2008. 43 [23] M. F. Duarte, M. B. Wakin, D. Baron, S. Sarvotham, and R. G. Baraniuk. Measurement bounds for sparse signal ensembles via graphical models. IEEE Trans. Inform. Theory, 59(7):4280–4289, July 2013. [24] A. Eftekhari, J. Romberg, and M. B. Wakin. Matched filtering from limited frequency samples. IEEE Trans. Inform. Theory, 59(6):3475–3496, June 2013. [25] A. Eftekhari, H. L. Yap, C. J. Rozell, and M. B. Wakin. The Restricted Isometry Property for random block diagonal matrices. arXiv preprint arXiv:1210.3395, 2012. [26] H. Federer. Curvature measures. Trans. Amer. Math. Soc, 93(3):418–491, 1959. [27] A. C. Gilbert, S. Muthukrishnan, and M. J. Strauss. Improved time bounds for near-optimal sparse Fourier representations. In Proc. SPIE Wavelets XI, 2005. [28] A. C. Gilbert, M. J. Strauss, and J. A. Tropp. A tutorial on fast Fourier sampling. IEEE Signal Proc. Mag., 25(2):57–66, 2008. [29] A.C. Gilbert, M.J. Strauss, J.A. Tropp, and R. Vershynin. One sketch for all: Fast algorithms for compressed sensing. In Proc. Symp. Theory Comput. (STOC), 2007. [30] D. Healy and D. J. Brady. Compression at the physical interface. IEEE Signal Proc. Mag., 25(2):67–71, 2008. [31] C. Hegde and R. G. Baraniuk. 58(12):7204–7214, 2012. Signal recovery on incoherent manifolds. IEEE Trans. Inform. Theory, [32] C. Hegde, M. B. Wakin, and R. G. Baraniuk. Random projections for manifold learning. In Proc. Neural Inform. Proc. Sys. (NIPS), December 2007. [33] G. E. Hinton, P. Dayan, and M. Revow. Modeling the manifolds of images of handwritten digits. IEEE Trans. Neural Networks, 8(1):65–74, January 1997. [34] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimenstionality. In Proc. Symp. Theory Comput. (STOC), 1998. [35] M. A. Iwen and M. Maggioni. Approximation of points on low-dimensional manifolds via random linear projections. Information and Inference, 2013. [36] S. Kirolos, J. Laska, M. Wakin, M. Duarte, D. Baron, T. Ragheb, Y. Massoud, and R. Baraniuk. Analog-toinformation conversion via random demodulation. In Proc. IEEE Dallas Circuits Systems Workshop (DCAS), October 2006. [37] F. Krahmer, S. Mendelson, and H. Rauhut. Suprema of chaos processes and the Restricted Isometry Property. arXiv preprint arXiv:1207.0235, 2012. [38] B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. Ann. Stat., 28(5):1302–1338, 2000. [39] M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. Ergebnisse der Mathematik Und Ihrer Grenzgebiete. Springer, 2011. [40] M. Lustig, D. L. Donoho, J. M. Santos, and J. M. Pauly. Compressed sensing MRI. IEEE Signal Proc. Mag., 25(2):72–82, 2008. [41] D. Needell and J. A. Tropp. CoSaMP: Iterative signal recovery from incomplete and inaccurate samples. Appl. Comput. Harmonic Anal., 26(3):301–321, 2009. [42] D. Needell and R. Vershynin. Signal recovery from incomplete and inaccurate measurements via regularized orthogonal matching pursuit. IEEE J. Sel. Topics Signal Proc., 4(2):310–316, 2010. [43] P. Niyogi, S. Smale, and S. Weinberger. Finding the homology of submanifolds with high confidence from random samples. Discrete Comput. Geom., 39(1):419–441, 2008. [44] F. W. J. Olver and National Institute of Standards and Technology (U.S.). NIST Handbook of Mathematical Functions. Cambridge University Press, 2010. [45] A. V. Oppenheim and R. W. Schafer. Discrete-Time Signal Processing. Prentice-Hall Signal Processing Series. Pearson Education, 2011. [46] J. Y. Park and M. B. Wakin. A geometric approach to multi-view compressive imaging. EURASIP J. Advances Signal Proc., 2012(1):37, 2012. [47] D. Ramasamy, S. Venkateswaran, and U. Madhow. Compressive estimation in AWGN: General observations and a case study. In Proc. Asilomar Conf. Signals, Systems, and Computers, 2012. [48] H. Rauhut. Circulant and Toeplitz matrices in compressed sensing. In Proc. SPARS’09, Saint-Malo, France, 2009. 44 [49] H. Rauhut. Compressive sensing and structured random matrices. In M. Fornasier, editor, Theoretical foundations and numerical methods for sparse recovery, pages 1–92. De Gruyter, 2010. [50] H. Rauhut, J. Romberg, and J. Tropp. Restricted isometries for partial random circulant matrices. Appl. Comput. Harmonic Anal., 32(2):242–254, 2012. [51] J. Romberg. A uniform uncertainty principle for Gaussian circulant matrices. In Proc. Int. Conf. Digital Signal Proc., 2009. [52] J. Romberg and R. Neelamani. Sparse channel separation using random probes. Inverse Problems, 26(11):115015, 2010. [53] M. Rudelson and R. Vershynin. On sparse reconstruction from Fourier and Gaussian measurements. Comm. Pure Appl. Math., 61(8):1025–1045, 2008. [54] P. Shah and V. Chandrasekaran. Iterative projections for signal identification on manifolds: Global recovery guarantees. In Allerton Conf. Comm., Control, Comput. (Allerton), pages 760–767, 2011. [55] G. W. Stewart. Error and perturbation bounds for subspaces associated with certain eigenvalue problems. SIAM Rev., 15(4):727–764, 1973. [56] M. Talagrand. The Generic Chaining: Upper and Lower Bounds of Stochastic Processes. Springer Monographs in Mathematics. Springer, 2005. [57] J. A. Tropp, J. N. Laska, M. F. Duarte, J. K. Romberg, and R. G. Baraniuk. Beyond Nyquist: Efficient sampling of sparse bandlimited signals. IEEE Trans. Inform. Theory, 56(1):520–544, 2010. [58] M. Turk and A. Pentland. Eigenfaces for recognition. J. Cogn. Neurosci., 3(1):71–83, 1991. [59] N. Verma. Distance preserving embeddings for general n-dimensional manifolds. J. Machine Learning Research, 14(Aug):2415–2448, 2013. [60] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y.C. Eldar and G. Kutyniok, editors, Compressed Sensing: Theory and Applications. Cambridge University Press, 2012. [61] M. B. Wakin. The Geometry of Low-Dimensional Signal Models. PhD thesis, Department of Electrical and Computer Engineering, Rice University, Houston, TX, 2006. [62] M. B. Wakin, D. L. Donoho, H. Choi, and R. G. Baraniuk. The multiscale structure of non-differentiable image manifolds. In Proc. Wavelets XI at SPIE Optics and Photonics, August 2005. [63] H. L. Yap, M. B. Wakin, and C. J. Rozell. Some geometric properties of sampled sinusoids. Technical report, School of Elect. and Comput. Eng., Georgia Inst. of Technol., Atlanta, GA, USA, June 2013. [64] H. L. Yap, M. B. Wakin, and C.J. Rozell. Stable manifold embeddings with structured random matrices. IEEE J. Sel. Topics Signal Proc., 7(4):720–730, August 2013. 45

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement