Generalized sampling and infinite-dimensional compressed sensing Ben Adcock Department of Mathematics Simon Fraser University Burnaby, BC V5A 1S6 Canada Anders C. Hansen DAMTP, Centre for Mathematical Sciences University of Cambridge Wilberforce Rd, Cambridge CB3 0WA United Kingdom Abstract We introduce and analyze an abstract framework, and corresponding method, for compressed sensing in infinite dimensions. This extends the existing theory from signals in finite-dimensional vectors spaces to the case of separable Hilbert spaces. We explain why such a new theory is necessary, and demonstrate that existing finite-dimensional techniques are ill-suited for solving a number of important problems. This work stems from recent developments in generalized sampling theorems for classical (Nyquist rate) sampling that allows for reconstructions in arbitrary bases. The main conclusion of this paper is that one can extend these ideas to allow for both reconstructions in arbitrary bases and significant subsampling of sparse or compressible signals. 1 Introduction Compressed sensing (CS) has, with little doubt, been one of the great successes of applied mathematics in the last decade [12, 15, 17, 19, 21, 22]. It allows one to sample at rates dramatically lower than conventional wisdom suggests – namely, the Nyquist rate – provided the signal to be recovered is sparse in a particular basis, and the sampling vectors are relatively incoherent. However, compressed sensing is currently only a finite-dimensional theory. It concerns the recovery of vectors in some finite-dimensional space (typically RN or CN ) whose nonzero entries with respect to a particular basis are small in number in comparison to N . Herein lies a problem. Real-world signals are typically analog, or continuous-time, and thus are modeled more faithfully in infinite-dimensional function spaces [8]. Any finite-dimensional model may therefore not be well suited in practice (we give examples of this in §2). Although this issue has been widely recognized, there have been almost no attempts made thus far to extend the existing theory to a more realistic, infinite-dimensional model. With this in mind, the purpose of this paper is to provide such a generalization. The first step in this regard is a move away from the current matrix-vector model. In particular, we consider the following more realistic scenario: a signal f is viewed as an element of a separable Hilbert space H, and its measurements are modeled as a sequence of linear functionals ζj : H → C. This gives rise to the countable collection ζ1 (f ), ζ2 (f ), ζ3 (f ), . . . , (1.1) of samples of f . Now suppose that the signal f is sparse in some orthonormal basis {ϕj }j∈N of H. The main questions we answer in this paper are the following: can we recover f by subsampling from the collection (1.1), and how can this realized by a numerical algorithm? In doing so, we obtain a full theory for so-called infinite-dimensional compressed sensing, valid for (almost) arbitrary pairs of sampling schemes {ζj }j∈N and reconstruction bases {φj }j∈N . The theory we introduce in this paper stems from recent developments in classical sampling of signals. In [1, 3, 4] a new sampling theory, known as generalized sampling, was introduced for reconstructions of signals in arbitrary bases {ϕj }j∈N from their samples (1.1) (see §1.4 and §4 for further details). The framework we propose in this paper is a continuation of this work that allows one to take advantage of sparsity to allow for subsampling. To illustrate, one of the important results we prove in this paper is as follows: 1 P∞ Theorem 1.1. For f = j=1 αj ϕj write ∆ = {j : αj 6= 0} and suppose that ∆ ⊂ {1, . . . , M } for some M ∈ N. Let > 0 be arbitrary. Then, there exists an integer N ∈ N (specific estimates will be given in §7) depending on M and |∆| only such that the following holds: if Ω ⊂ {1, . . . , N }, |Ω| = m, is chosen uniformly at random, then, with probability greater than 1 − , f can be recovered p exactly from the samples {ζj (f ) : j ∈ Ω} given that m is proportional to |∆| · log(−1 + 1) · log(N M |∆|). As we explain in §6 and §7, the values m, N are specified explicitly by a system of inequalities involving |∆|, M and . The somewhat surprising result is that taking N = M will typically not be sufficient. As we demonstrate in §2, there are straightforward examples where |∆| may be very small, but choosing N = M will give disastrous results. However, choosing a particular value N > M (specific bounds will be given later) will allow for dramatic subsampling (see §2.3). 1.1 An example The model detailed above is particularly important in Magnetic Resonance Imaging (MRI), and this application shall serve as our primary example in this paper. In this setting f , the image, is sampled via inner products with complex exponentials, and the infinite collection {ζj (f )}j∈N of samples corresponds to the Fourier coefficients of f (typically H = L2 (R2 )). Suppose now that we know that f is sparse in, say, a wavelet basis. The question is, how can we recover f exactly by subsampling from the given Fourier coefficients {ζj (f )}j∈N ? In particular, to what extent can we subsample (i.e. how does the number of samples m we require depend on the sparsity, where m is as in Theorem 1.1), and what is the range N from which we need to draw our samples (clearly, it makes no sense to draw samples uniformly with indices ranging over N). The MRI problem served as a strong motivation for much of the original development of CS [15]. Work by Lustig et al has led to the successful application of finite-dimensional techniques in this area [33]. However, as we shall explain in §2, there are a number of significant issues in tackling the infinite-dimensional MRI problem with the current techniques. Indeed, the current approach may well fail quite dramatically, even in extremely simple examples (see §2.1), and any attempt to remedy it whilst remaining in the finitedimensional setting may also not succeed (see §2.2). Fortunately, the approach we propose in this paper, which is based on the aforementioned infinite-dimensional model, successfully overcomes these shortcomings (§2.3). 1.2 The need for a new general theory It is natural to ask whether such a new theory is necessary or not. Given the problem described above, one must at some stage discretize. It therefore seems plausible that finite-dimensional CS could be readily applied once one had restricted the problem from the underlying Hilbert space H to a suitable finite-dimensional space (e.g. CN ). In particular, if f is sparse in an, albeit countably-infinite, basis {ϕj }j∈N (i.e. it only has finitely many nonzero coefficients in this basis), it seems plausible that the corresponding sparse recovery problem is inherently finite-dimensional. In some limited cases this is indeed the case: one may treat the problem solely in finite dimensions with existing CS tools (see Remark 2.3). However, as we discuss in §2, in general this problem cannot be tackled in such a way. Indeed, ‘discretizing’ the problem – that is, reducing the infinite amounts of information contained in the samples {ζj (f )}j∈N – so as to make it amenable to computations, is fraught with difficulties. In particular, the most obvious, and ultimately most naı̈ve, discretizations may well destroy the original structure of the problem. This means that exact recovery may well not be possible with finite-dimensional techniques, since the key structure that allows for subsampling has been destroyed by the discretization. This issue is discussed further in §2.1 and §2.2. Aside from how the discretization is carried out, there is another problem with the finite-dimensional viewpoint. To illustrate this issue, consider now a more realistic scenario, where the signal to be recovered f = g + h is not sparse but compressible. In other words, g is sparse in {ϕj }j∈N , whereas h is not sparse but is small in norm. This is now a fully infinite-dimensional problem: both the set of samples {ζj (f )}j∈N and the support of f in {ϕj }j∈N have infinite cardinality. Of course, it is now no longer possible to recover f exactly. However, it is natural to ask whether we can recover f up to some error determined only by h. In particular, it is critically important that this error behave well as both the total number of samples used and the range from which such samples are drawn are varied. That is, rather than exact recovery, one must also 2 consider the important, and fundamentally infinite-dimensional, notion of approximation. Arguably, there is little hope of being able to do this using solely existing finite-dimensional CS tools. To impress this point further on the reader, consider for the moment an analogy. A huge component of the field of numerical analysis is devoted to both the design and study of numerical discretizations of partial differential equations (PDEs). Typically the discretization takes the form of a solution to a linear systems of equations [11, 36]. However, to study the key approximation properties of the discretization (i.e. how rapidly the method converges as the discretization parameter – step/mesh size etc – is varied), one must also have knowledge of the underlying PDE. In particular, in the analysis of finite element methods, for example, one establishes existence and uniqueness of finite-dimensional discretizations, as well as error estimates, by first proving regularity results for the given infinite-dimensional PDE (coercivity conditions, for example) [11, 18]. Moreover, reversing the argument, regularity properties of the PDE (in other words, the infinite-dimensional structure of the problem) also inform the design of new numerical methods. In summary, the message is as follows: the discretization cannot be studied in isolation from the underlying PDE. As it transpires, much the same is true for the sparse recovery problem: formulating the problem in infinite-dimensions is necessary both in designing a discretization, and in analyzing the resulting numerical method. 1.3 Discretizing infinite-dimensional problems How such a discretization is performed is a crucial issue in this paper. In general, discretising infinitedimensional problems is a difficult and subtle issue which cannot be carried out successfully without a thorough understanding of the particular problem at hand. Unless done carefully, it is quite possible to end up with a discretization whose properties contrast starkly with those of the original problem, and consequently a numerical method that is neither stable nor convergent. With this in mind, our approach is based on the following fundamental philosophy: Keep the infinite-dimensional structure and crucial properties of the original problem when discretising. (Ph) By correctly following this principle we obtain a method for infinite-dimensional CS that allows significant subsampling of sparse and compressible signals, and thereby extends existing techniques to infinitedimensional problems. Notice that the approach we propose in this paper is a significant departure from the current conceived wisdom. Indeed, rather than discretising first, our initial step is to devise an appropriate infinite-dimensional formulation of the sparse recovery problem. Truncation is then carried out in the second step, leading to a finite-dimensional problem which can be solved numerically. It is also worth mentioning that (Ph) is by no means unique to this particular problem. Whilst it is often implicitly followed in numerical PDEs, most relevantly for this paper it was recently employed in [29] to solve the long-standing computational spectral problem. A number of ideas in this article stem from [29], and the contributions of this paper may be viewed as a continuation of this work. 1.4 Generalized sampling (GS) The framework we propose in this paper has its direct origins in recent developments in classical, i.e. Nyquist rate, sampling. Specifically, the question of how to recover signals (not necessarily sparse or compressible) in arbitrary bases from their samples (1.1). Although this question has long been studied in sampling theory, there had been little effort given over to the issue of discretizations, and the key notion of approximation [41] (see also [1, 3]). By employing (Ph), a new approach, known as generalized sampling (GS), was developed in [1, 3, 4]. GS is a new type of sampling theory which incorporates the critical issues of approximation and stability, leading to the so-called stable sampling rate [3]. The resulting numerical method allows for guaranteed recovery of any signal in an arbitrary basis from a collection of its samples in a manner which is both numerically stable and, in a certain sense, optimal. In this paper we build on this work in the following way: we show that, in the case that the signal to be reconstructed is sparse or compressible, reconstruction can also be performed with significant subsampling. We refer to the corresponding method as generalized sampling with compressed sensing (GS–CS). One important instance of both GS and GS–CS is the recovery of a function from its Fourier samples (the MRI problem, in particular). Although the classical Shannon Sampling Theorem [31, 41] allows for 3 reconstruction in terms of an infinite series of sinc functions or complex exponentials, the slow convergence of these series, as well as the appearance of the Gibbs phenomenon [32], means that such a reconstruction is often not practical. Consequently, Shannon’s theorem may not be used in practice [41], even for Nyquistrate sampling. Nonetheless, in many circumstances it is well known that the given signal can be wellrepresented (i.e. it is sparse or has rapidly decaying coefficients) in a new basis of functions; be they splines, wavelets, curvelets, etc [26]. GS allows one to reconstruct in such a basis in manner that is both accurate and numerically stable. The method we develop in this paper, GS–CS, permits one to dramatically undersample whenever the signal is sparse or compressible. 1.5 Summary and outline As has now be made clear to the reader, the viewpoint of this work is very much that of an infinitedimensional (or analog) world. In other words, we consider images and signals as infinite-resolution objects, and not as objects of fixed resolution defined uniquely by a finite number of their samples. Thus, at best, an arbitrary image can only be approximated by a finite collection of its samples (e.g. by its Fourier series in the case of MRI). However, a key conclusion of this paper is that one may obtain far better approximations than, for example, the Fourier series, by reconstructing in another basis, especially if the image is sparse or compressible in such a basis. This stance is by no means unique to this paper, and it has long been recognized that such a viewpoint is better suited to many problems. For example, the idea of finite rates of innovation, developed by Vetterli et al [8, 23, 43], is very similar to this principle, although the resulting technique is completely different to that which we propose. The outline of the remainder of this paper is as follows. In §2 we explain why the current CS techniques are insufficient for infinite-dimensional problems, and why a new theory is therefore necessary. In §3 we introduce the specific types of models we consider in this paper. §4 recaps GS, and in §5 we introduce our method for infinite-dimensional compressed sensing: namely, GS–CS. The main results of this paper – recovery of sparse and compressible signals with GS–CS – are presented in §7, and in §8 we provide numerical examples. In §9–§11 we give proofs of the main results. 2 Why do we need a new approach? Consider the following very simple model problem (which will form the primary example throughout this paper): Problem 2.1. Suppose that f ∈ L2 (R) has support contained in [−1, 1], and let {ϕj }j∈N be the orthonormal basis of Haar wavelets on L2 (−1, 1). Define ζj (f ) = Ff (j), j ∈ Z, (2.1) to be the Fourier coefficients of f (this is precisely the type of sampling one encounters in MRI). Here Ff denotes the Fourier transform of f , and ≤ 21 is arbitrary. Throughout, we shall take = 12 . We will also assume that f is sparse, or compressible, in the basis {ϕj }j∈N of Haar wavelets. Thus, the problem is to recover f from the measurements (2.1). We shall return to this problem throughout this paper. Recall that the classical Shannon Sampling Theorem (see §4) gives that f can be recovered exactly (in the L2 sense) from the infinite collection {ζj (f )}j∈Z . However, since f is known to be sparse in the Haar wavelet basis {ϕj }j∈N , the question is as follows: is there a way this additional information could be used to allow f to be recovered from only a finite number of its samples? Moreover, if this is true, how many such samples are required, and is it possible to subsample? These are questions we now wish to answer. 2.1 First example - the standard approach fails Let us consider the simplest possible example: namely, let f = χ[0,1/2) − χ[1/2,1) be the classical Haar wavelet. Note that f is extremely sparse in the Haar basis. To recover f exactly from (2.1), at some stage one needs to discretize, so as to reduce the infinite amounts of information to something finite. The standard 4 1.5 0.1 1 0.05 0.5 0 0 −0.5 −0.05 −1 −1.5 −1 −0.5 0 0.5 −0.1 −1 1 −0.5 0 0.5 1 Figure 1: The figure shows the rather and f (t) − fN,m (t) (right) against t for 2N = 256 P disappointing fN,m (t) (left) 2N ξ ϕ (t) and ξ = {ξ } and m = 130, where fN,m (t) = 2N is a minimizer of (2.3). j j j j=1 j=1 technique, especially in sparse MRI [33], involves two steps. First, one replaces the infinite collection of samples with the finite vector y = ζ(f ) = {ζj (f ) : j = −N + 1, . . . , N }, N ∈ N. (2.2) Second, one uses a combination of the discrete Fourier and discrete wavelet transforms (DFT and DWT respectively) to formulate the corresponding measurement matrix. To this end, let Udf , Vdw ∈ C2N ×2N be the matrices of these transforms. The classical discrete approximation to the problem of inverting the Fourier transform is then given by y = Udf x, where x is a vector approximating pointwise values of f on an equispaced grid in [−1, 1]. Since f is sparse in the Haar basis it is very tempting to think that x0 = Vdw x is also sparse, if Vdw is the discrete Haar transform. One should therefore be able to recover f perfectly with finite-dimensional compressed sensing using only relatively few of its samples y = ζ(f ). More precisely, if Ω ⊂ {1, . . . , 2N }, |Ω| = m < 2N is chosen uniformly at random, then the usual approach would be to solve the convex optimization problem −1 min kηkl1 subject to PΩ Udf Vdw η = PΩ y. η∈C2N (2.3) Here PΩ : C2N → C2N is orthogonal the projection onto span{ej : j ∈ Ω} and {ej }2N j=1 is the canonical 2N basis for C . If ξ is a minimizer of this problem, then one could hope that ξ agrees with the vector x0 with −1 high probability, and hence we could recover x = Vdw x0 . To test (2.3), let us consider the case where 2N = 256 m = 130 – i.e. we samples nearly 50% of Pand 2N the samples in the range −N + 1, . . . , N . Write fN,m = j=1 ξj ϕj , where ξ is a minimizer of (2.3). Note −1 that fN,m takes the values of the vector Vdw ξ at the grid points. As Figure 1 demonstrates the result is extremely disappointing. The function f has not been recovered anywhere near exactly, and the reconstruction fN,m computed from (2.3) commits rather large errors (especially near the jumps in f , i.e. x = −1, 0, 21 , 1). To be more precise, even though f only has one nonzero coefficient in the Haar wavelet basis, and despite the fact that we use m = 130 Fourier samples of f , we do not get anywhere near to perfect recovery. The question is now: what went wrong, and why can we seemingly not recover f perfectly with this approach? The answer given in the next section. 2.1.1 The DFT destroys sparsity The source of the catastrophic failure of (2.3) is the discretization employed: namely, the DFT. The problem −1 is that the DFT is not exact – it commits an error. To illustrate this, consider matrix Udf . This maps the 5 vector of Fourier coefficients ζ(f ) of a function f to a vector consisting of pointwise values on an equispaced 2N -grid in [−1, 1]. However, this mapping is not exact: for an arbitrary function f , the result is only an approximation to the grid values of f . The question is, how large is the approximation error, and how does it affect (2.3) and its solutions? To quantify this effect, consider the vector x ∈ C2N defined by Udf x = ζ(f ). This vector represents grid values of an approximation to f . In fact, it is simple to see that x consists precisely of the values of the function fN (t) = N X Ff (j)e2πijt , = 1/2, (2.4) j=−N +1 on this grid. This function is nothing more than the truncated version of the reconstruction given by the Shannon Sampling Theorem (see §4). In other words, the approximation involved in the transform Udf is equivalent to replacing a function f by its partial Fourier series fN . Let us consider the discrete wavelet transform x0 ∈ C2N of x: x0 = Vdw x. The right-hand side of the equality constraint in (2.3) now reads −1 PΩ Udf Vdw x0 . Thus, for the method (2.3) to be successful we require x0 = Vdw x to be a sparse vector. However, this can never happen. Sparsity of x0 is equivalent to stipulating that the partial Fourier series fN be sparse in the Haar wavelet basis. The function fN consists of smooth complex exponentials, and thus cannot have a sparse representation in a basis of piecewise smooth functions. Therefore, although it has a unitary and incoherent measurement matrix, (2.3) is not a sparse recovery problem. Consequently there is little or no hope of recovering the sparse vector α of Haar wavelet coefficients of f from (2.3). This explains the complete failure witnessed in Figure 1. From this argument, we now conclude the following: by forming the approximation (2.4), we have destroyed the structure (i.e. the sparsity) of the original problem. In particular, we have violated the guiding principle (Ph). Remark 2.1 Note that this loss of sparsity is not exclusive to the Haar wavelet basis. In fact, if f is sparse in any basis of compactly supported wavelets then, by insisting on using the Shannon Sampling Theorem, we also witness the same problem: namely, fN can never be sparse in the same basis. 2.1.2 The DFT leads to the Gibbs phenomenon Whilst the loss of sparsity is significant, there is another important issue with the use of the DFT. Given that η is not sparse in Haar wavelets, suppose now, as an exercise, we forgo any subsampling. That is, we let m = 2N . The problem (2.3) now has a unique solution η. However, by the arguments given above, the entries of η are not the Haar wavelet coefficients of f , but rather coefficients of the approximation fN . Thus, by solving (2.3) (both with and without subsampling) we are not actually computing Haar wavelet coefficients of f , but those of the partial Fourier series fN instead. Thus, we cannot hope to obtain a better (i.e. more accurate) reconstruction of f than fN . The question is, how good an approximation is fN ? Given that f is piecewise smooth, it turns out to be very poor. In fact, as N → ∞, fN does not converge uniformly to f , and only converges very slowly in the weaker L2 norm. One also witnesses the unpleasant Gibbs phenomenon, with its associated O (1) oscillations, near each jump in f . These effects are visualized in Figure 2. The fact that (2.3) leads to a Haar wavelet approximation to fN , as opposed to f , can be observed by comparing the left panels in Figures 2 and 1 respectively. Of course, one may attempt various remedies to this problem, such as increasing N or using the total variation norm instead. However, the key point is, regardless of how clever we are, if we insist on performing reconstruction via Udf , then we cannot hope to obtain any better than the (extremely poor) approximation fN , the partial Fourier series. 6 1.5 0.1 1 0.05 0.5 0 0 −0.5 −0.05 −1 −1.5 −1 −0.5 0 0.5 −0.1 −1 1 −0.5 0 0.5 1 Figure 2: The figure shows fN (t) (left) and f (t) − fN (t) (right) against t for 2N = 256. Remark 2.2 This particular issue, poor convergence of fN , is also not unique to Haar wavelets. Indeed, any piecewise smooth (in particular, nonperiodic) function will have the same problem. Only functions f which are smooth and periodic on [−1, 1] will be well approximated by their Fourier series fN . However, it is typically very rare for real-world signals and images to have such properties [41]. Remark 2.3 The well-informed reader may object and suggest that if we have a signal f that is sparse in the Haar basis, why do we not measure f with noiselets [20] (e.g. ζj (f ) = hf, ψj i, where the ψj ’s are noiselets). This approach would give a finite-dimensional recovery problem suffering from none of the aforementioned issues. However, the luxury to choose the measurement system is very rare in applications. In particular, in the example that we just considered (which serves as a model for MRI) we are given {Ff (j)}j∈Z , and this cannot be readily altered. Thus, to properly solve these types of problems, we must have a model for which the sampling scheme {ζj }j∈N is fixed. 2.2 Second example - a common misunderstanding The failure of (2.3) can be interpreted as a violation of the fundamental principle (Ph). In particular, the crucial property that f is sparse in the Haar wavelet basis is destroyed when the DFT is applied. With this in mind, it may seem to the reader that, since f is a finite sum of Haar wavelets, there is a simple remedy to this problem: namely, replace the DFT and DWT by the measurement matrix ζ1 (ϕ1 ) · · · ζ1 (ϕ2N ) .. .. .. (2.5) UN = , . . . ζ2N (ϕ1 ) · · · ζ2N (ϕ2N ) randomly sample Ω ⊂ {1, . . . , 2N } with |Ω| = m, obtain a minimizer ξ to the problem min kηkl1 subject to PΩ UN η = PΩ ζ(f ), η∈C2N (2.6) P2N and form the reconstruction fN,m = j=1 ξj ϕj , Clearly this approach, unlike (2.3), preserves the sparsity of the original problem. In this case we have, for convenience, reindexed in the natural way the Fourier samples {ζj }j∈N over N rather than Z. P2N Let us consider an example of (2.6). Suppose that f (t) = j=1 αj ϕj (t), where 2N = 768. We have chosen a function f such that |supp(f )| = |{αj : αj 6= 0}| = 5. In particular, the function f is very sparse in the Haar wavelet basis. In Figure 3 we display the reconstruction given by (2.6) using m = 760. As is evident, f is recovered extremely poorly by (2.6): although we have used nearly 98% of its Fourier samples in the range 1, . . . , 2N , the reconstruction error kf − fN,m k is very large (roughly 2.43 and kf k = 3.21). We repeated the experiment fifty times with the same outcome. This example is disastrous. Despite altering the standard CS approach (2.3) to ensure that sparsity is preserved, we still obtain an extremely poor reconstruction. Indeed, since f is sparse in the Haar basis and 7 −10 x 10 30 1 20 0.5 10 0 0 −10 −0.5 −20 −1 −30 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 Figure 3: The figure shows the disastrous error f (t) − fN,m (t) against t (left) as well as the much more pleasant error f (t) − gNe ,m̃ (t) (right). Note that fN,m requires m = 760 samples whereas gNe ,m̃ requires only m̃ = 50 samples. we sample using nearly all its Fourier samples in the range 1, . . . , 2N , one may have reasonably hoped to recover f perfectly in this example. However, as seen in Figure 3, this is certainly not the case. From a conventional CS viewpoint, this failure appears quite surprising. We have formed a measurement matrix in a standard way by taking inner products of one orthonormal basis (the complex exponentials {e2πik· }N k=−N +1 ) with another (Haar wavelets). Surely the standard finite-dimensional results should apply [12]? However, they do not, as evidenced by Figure 3. The reason is that UN is not unitary (or even close to unitary). The unexpected nature of this failure is most likely due to the following common misconception: a change-of-basis matrix of the form (2.5) is only unitary when the bases span the same space. This is clearly not the case in (2.5), where the sampling and reconstruction bases consist of the first 2N complex exponentials and Haar wavelets respectively. The key here is the underlying infinite dimensionality. One needs infinitely many complex exponentials to span a space large enough so that it will contain finitely many Haar wavelets (or vice versa). From this discussion we draw the following conclusion. Simply thinking (since f has only finitely many non-zero Haar wavelet coefficients) that the problem can be embedded in CN and solved using finitedimensional CS is incorrect. In fact the failure of UN can also be interpreted as a violation of (Ph). As the above argument indicates, UN is not unitary, whereas the ‘infinite measurement matrix’ ζ1 (ϕ1 ) ζ1 (ϕ2 ) · · · (2.7) U = ζ2 (ϕ1 ) ζ2 (ϕ2 ) · · · , .. .. .. . . . formed by combining the full countably-infinite bases does possess this property (both Haar wavelets and complex exponentials form orthonormal bases for the infinite-dimensional Hilbert space L2 (−1, 1)). It is precisely the loss of this structure when ‘discretising’ U via UN that is the source of the failure observed above. 2.3 Third example - a new approach With these examples in mind, the purpose of the remainder of this paper is to describe a new approach for infinite-dimensional CS, known as generalized sampling with compressed sensing (GS–CS), which overcomes the aforementioned failings. This brings us to the purpose of this section, and really the essence of this paper: namely, why infinite dimensions? Put simply, this is because the search for the coefficients α = {α1 , α2 , . . .} of f results in an infinite system of equations. By formulating reconstruction directly in an infinite-dimensional way, and then discretising (as opposed to discretising first), we are able to completely avoid the pitfalls described above. P2N We will now present a straightforward example of this approach. Let f (t) = j=1 αj ϕj (t) be as in 8 §2.2. Note that u11 ζ1 (f ) ζ2 (f ) u21 = .. .. . . u12 u22 .. . α1 ... α2 . . . , .. .. . . uij = ζi (ϕj ). It is by attacking this infinite set of equations directly that we find a successful approach. The key is to realize that, although f has only five non-zero wavelet coefficients, it has infinitely many Fourier coefficients {ζj (f )}j∈N . Thus, recovering f from these sampled Fourier coefficients is fundamentally not a finite-dimensional problem. Consider now the following. Let U = {uij }i,j∈N . We will use an uneven section of this infinite matrix (this is a trick that stems from [29]). Specifically, for K ∈ N let PK denote the projection onto e ∈ N, we choose Ω ⊂ span{e1 , . . . , eN }, where {ek }k∈N is the canonical basis for l2 (N). Now, for N e {1, . . . , N } uniformly at random, with |Ω| = m̃, and numerically compute a minimizer ξ to inf η∈l1 (N) kηkl1 subject to PΩ PNe U PM η = PΩ y, y = {ζ1 (f ), ζ2 (f ), . . .}, (2.8) e = 1351 and m̃ = 50, and M = 768. One can now easily reconstruct the function where M ∈ N. Let N PM f of §2.2. In particular, by letting gNe ,m̃ = j=1 ξj ϕj , where ξ is a minimizer of (2.8), we obtain kf − gNe ,m̃ kL2 = 1.15 × 10−11 (this is the average of 50 trials). This error is displayed in Figure 3. Note the dramaticimprovement over the approach of §2.2. Specifically, the error has been reduced from O (1) to O 10−10 . Moreover, the GS–CS reconstruction uses less than 7% of the number of sampled Fourier coefficients that were used to form the extremely poor reconstruction in (2.6). Remark 2.4 The true error committed by GS–CS is most likely zero. However, the numerical error is polluted by the limited accuracy of the solver used to compute the relevant l1 optimization problem. So, why do we get such a great result with this new approach? After all, we used much fewer samples e ? These questions will be answered in this paper, and the key is what than in §2.2. Also, how did we choose N we refer to as the Balancing Property (see §6.2). We shall present further numerical examples illustrating the effectiveness of this new approach in §8. 3 Infinite-dimensional compressed sensing To develop a theory for infinite-dimensional CS it is useful to now introduce the types of signal models we consider in this paper, and to formally define the problemPwe consider. Suppose that H is a separable ∞ Hilbert space with an orthonormal basis {ϕj }j∈N . Let f = j=1 αj ϕj be the signal we wish to recover, and suppose that we have access to the countable collection of samples ζ1 (f ), ζ2 (f ), ζ3 (f ), . . . , (3.1) where ζj : H → C are continuous linear functionals on H. The problem throughout this paper will be to recover f in terms of {ϕj }j∈N from the samples (3.1). Since we consider sparse signals f it is useful to introduce the notation supp(f ): supp(f ) = {j ∈ N : αj 6= 0}. Here αj are the coefficients of f in the basis {ϕj }j∈N . When its meaning is clear, we shall also write ∆ for supp(f ). 3.1 The semi-infinite dimensional model We first assume that f is exactly sparse in the basis {ϕj }j∈N . In other words, there exists an M ∈ N such that ∆ = supp(f ) ⊂ {1, . . . , M }. (3.2) Naturally, we do not know ∆, however, we may have information about M . We refer to this model as semiinfinite dimensional: although f has only finite support in {ϕj }j∈N , we can access samples ζj (f ) from the countable collection (3.1). 9 In practice the assumption that f is perfectly sparse is often unrealistic. Thus, a more reasonable scenario is the following: f = g + h, ∆ = supp(g) ⊂ {1, . . . , M }, supp(h) = {1, . . . , M }. (3.3) In this case, one can no longer expect perfect recovery of f with subsampling. However, it is highly desirable to have a bound on the error in reconstructing f . Specifically, we wish to show perfect recovery of f with high probability, up to an error determined only by some appropriate norm of h. 3.2 The fully infinite-dimensional model Whilst (3.3) is more realistic than the exact sparsity model (3.2), it is rare in practice that supp(h) is finite. In particular, in the fully-infinite dimensional model we consider the significantly more general setting: f = g + h, ∆ = supp(g) ⊂ {1, . . . , M }, |supp(h)| = ∞. (3.4) This model is termed fully infinite-dimensional since the support of f is infinite, as is its set of samples. Again we may pose the same question: how well can we reconstruct f , and how does the error behave in terms of h? Let us at this stage notice that (3.3) and (3.4) are, in many senses, very different problems. Correspondingly, the theorems we present about each problem are quite different in character. Whilst the skeptical reader may think still it plausible that (3.3) could be tackled by existing finite-dimensional CS techniques (despite the discussion in §2) there is little hope of doing the same for (3.4). 4 Generalized sampling: guaranteed recovery in arbitrary bases Before discussing how to subsample infinite-dimensional signals, it is first necessary to consider the more basic case where no sparsity is present. The question is, how can P one actually reconstruct arbitrary signals f ∞ from their measurements {ζj (f )}j∈N ? Or in other words, if f = j=1 αj ϕj , how do we recover the infinite vector α = {α1 , α2 , . . .} from the samples ζ1 (f ), ζ2 (f ), . . .. Only once this problem has been solved can one properly consider the issue of subsampling. Fortunately, the technique of generalized sampling (GS) was developed precisely to solve this problem [1, 3, 4]. We now recap this approach. Under some assumptions on {ζj }j∈N (e.g. each ζj is continuous and {ζj (f )}j∈N ∈ l2 (N), ∀f ∈ H), we can view the full recovery problem as the infinite-dimensional system of linear equations U α = ζ(f ), where α = {α1 , α2 , . . .}, ζ(f ) = {ζ1 (f ), ζ2 (f ), . . .} and U is the infinite measurement matrix ζ1 (ϕ1 ) ζ1 (ϕ2 ) · · · U = ζ2 (ϕ1 ) ζ2 (ϕ2 ) · · · . .. .. .. . . . (4.1) (4.2) Clearly, if we were able to invert U , and provided we had access to all samples of f , then we could recover α (and hence f ) exactly. However, this is never the case in practice. Instead, we must consider truncations of (4.1), and look to compute approximations α̃1 , . . . , α̃N to the first N coefficients of α. At this point we make the following critical remark. Whatever strategy we use for computing such approximate coefficients, the result α̃[N ] = {α̃1 , . . . , α̃N } ∈ CN must be a good approximation to the first N exact coefficients α1 , . . . , αN . Recall that the whole premise for recovering f in the basis {ϕj }j∈N is that we know that f is well represented in this basis. In other words, the coefficients {αj }j∈N decay rapidly as j → ∞, or, in the case where f is sparse, only a finite number are nonzero. Therefore, whichever method we use for solving (4.1), it is vital that the error kα − α̃[N ] kl2 , (4.3) is small (here and later, for convenience, we shall not make a distinction between the finite vector α[N ] = {α̃1 , . . . , α̃N } ∈ CN and its imbedding {α̃1 , . . . , α̃N , 0, 0, . . .} in l2 (N)). Clearly, the examples of §2 violate this condition: specifically, the failure witnessed is a result of the error (4.3) being large. 10 4.1 Finite sections: a warning from spectral theory The most obvious approach for discretising (4.1) follows from taking finite sections of U . In other words, if PN : l2 (N) → span{ej : j = 1, . . . , N } is the orthogonal projection, we consider solutions α̃[N ] to the N × N system of equations PN U PN α̃[N ] = PN ζ(f ). (4.4) Note that PN U P N ζ1 (ϕ1 ) · · · .. .. = . . ζN (ϕ1 ) · · · ζ1 (ϕN ) .. , . ζN (ϕN ) is nothing more than the leading N × N submatrix (i.e. the finite section) of U . Finite sections [9, 10, 28] are extremely widely used in practice. However, for general operators U there is no guarantee that α̃[N ] need either exist, or that α̃[N ] (if it exists) actually converges to α as N → ∞. In fact, it is easy to devise pairs of bases {ϕj }j∈N and sampling schemes {ζj }j∈N for which the error kα − α̃[N ] kl2 blows up as N → ∞, whenever α̃[N ] is the result of the finite section method [1, 2]. Another significant issue is that the finite section matrix PN U PN ∈ CN ×N may be extremely poorly conditioned, even though U and its inverse U −1 are bounded. Examples of operators U whose finite sections exhibit exponentially poor conditioning were given in [1]. In particular, the measurement matrix formed by sampling in the Fourier basis and reconstructing in Haar wavelets (the principal example of this paper) suffers from this phenomenon. As a result, the numerical method based on finite sections is not just nonconvergent, it is also extremely unstable and highly sensitive to noise (see also [2, 3]). The failure of the finite section method for solving (4.1) can be viewed as a violation of the principle (Ph). Finite sections have been studied extensively from the viewpoint of computational spectral theory. Therein one typically wishes to gain information about the spectrum of U by considering discretizations of the form PN U PN [7, 28]. The main conclusion is that, unless U satisfies some very stringent restrictions (such as positive self-adjointness), its finite sections PN U PN may have wildly different (spectral) properties. This violates (Ph), and thus makes finite sections typically unsuitable for solving (4.1). In particular, for operators U of the form (4.2) an important property is unitarity. Whenever the measurements ζj (f ) = hf, ψj i for some orthonormal system {ψj }j∈N (as is the case with Fourier sampling) the matrix U is a unitary operator on l2 (N). In general, however, there is no guarantee that the finite sections PN U PN are unitary, in violation of (Ph). Note that the finite section PN U PN is precisely the measurement matrix (2.5) encountered in the finitedimensional CS approach (2.6). As commented in §2.2, the loss of the unitary structure of U accounts for the failure seen therein. 4.2 Uneven sections and generalized sampling (GS) Fortunately, there is a simple, albeit far less common, way to overcome the failure of finite section method, based on taking rectangular, as opposed to square, sections of U . In [1, 4] it was proposed to replace (4.4) with Aα̃[M ] = PM U ∗ PN ζ(f ), A = PM U ∗ PN U P M , (4.5) where M ∈ N (the number of coefficients α̃1 , . . . , α̃M computed) is appropriately chosen (typically M ≤ N ). The result is known as generalized sampling (GS). Note that A ≡ (PN U PM )∗ PN U PM , where PN U PM is the N × M uneven section of U . The main idea is that, by allowing M to vary independently of N , one can obtain both a numerically stable and accurate reconstruction of the first M values α1 , . . . , αM . Note that this means that we typically recover fewer of the coefficients α1 , α2 , . . . than in the finite section method. However, unlike the latter, it is possible to guarantee both the stability and accuracy of this approach. In other words, by being less greedy in the number of coefficients we seek to recover, we actually obtain a far better result. The main theorem proved in [1, 3] is as follows: Theorem 4.1. Let U ∈ B(l2 (N)) be an isometry and suppose that M ∈ N is given. Then there exists an N0 ∈ N such that, for every N ≥ N0 , there is a unique solution α̃[M ] to (4.5). Furthermore, we have the sharp bound 1 kα − α̃[M ] k ≤ kP ⊥ αk, (4.6) CN,M M 11 where CN,M = kPM − PM U ∗ PN U PM k. (4.7) Specifically, N0 is the least N such that CN,M > 0. It can be shown that the quantity CN,M → 1 as N → ∞, for any fixed M . Thus, one deduces from (4.6) that α̃[M ] can be made arbitrarily close to PM α – the best M -term approximation to α – by varying N suitably. Hence, a good reconstruction can always be guaranteed with this approach. Furthermore, the −1 resulting method is also stable. The condition number of the matrix A scales like CN,M [3]. That is to say, precisely the same quantity that ensures accuracy of the reconstruction also guarantees numerical stability. Note that CN,M can be easily computed numerically by finding the norm of an N × N matrix. Hence, the conditions of Theorem 4.1 can be verified numerically. Having said this, in numerous circumstances of interest one can also obtain analytical bounds [1, 4]. To connect generalized sampling with the principle (Ph), note that the uneven section PN U PM inherits the structure of U , whenever M and N are chosen suitably. In fact, for fixed M , P M U ∗ P N U P M → PM U ∗ U P M = PM , N → ∞, where I : l2 (N) → l2 (N) is the identity. Thus, PN U PM is also an isometry on the range of PM in the limit N → ∞. 4.3 A generalized Shannon Sampling Theorem One of the main instances of GS, and a central reason for its development, is the case where {ζj (f )}j∈Z corresponds to the Fourier samples (2.1) (we replace the index set N with Z in this instance). Although the famous Shannon Sampling Theorem ensures that both f and its Fourier transform Ff can be recovered exactly via the infinite sums X X t + j , f (·) = Ff (j)e2πij· , Ff (t) = Ff (j)sinc j∈Z j∈Z (note that the first converges in L2 , whereas the second converges both uniformly and in L2 ), in practice one has to truncate these series, leading to the approximations bN/2c bN/2c fN (t) = X Ff (j)e 2πijt , X FfN (t) = Ff (j)sinc j=−bN/2c j=−bN/2c t + j . (4.8) As discussed, these are typically very poor reconstructions of f and Ff respectively. However, suppose now we know another basis {ϕj }j∈N in which f is well represented. Then we can apply GS to obtain an improved reconstruction in this basis. This leads to the following generalization of Shannon’s theorem: Theorem 4.2 (Generalized Sampling Theorem [1, 4]). Let F denote the Fourier transform on R, and suppose that {ϕj }j∈N is an orthonormal set in L2 (R) satisfying supp(ϕj ) ⊂ [−T, T ] for all j ∈ N and some T > 0. 1 For 0 < ≤ 2T let √ ζj (f ) = Ff (ρ(j)), j ∈ Z, f ∈ L2 (R), where ρ : N → Z is some bijection, and suppose that U is given by (4.2). Then, for each M ∈ N there is an N ∈ N such that there exists a unique solution α̃[M ] ∈ CM to (4.5), for any f ∈ span{ϕj }j∈N . Moreover, if fN,M = M X α̃j ϕj , gN,M = j=1 M X α̃j Fϕj , j=1 then kf − fN,M kL2 (R) ≤ and 1 CN,M ⊥ kPM f kL2 (R) , √ kg − gN,M kL2 (R) ≤ 2T kP ⊥ f kL2 (R) , CN,M M where g = Ff , CN,M is given by (4.7) and PM denotes the projection onto span{ϕ1 , . . . , ϕM }. 12 (4.9) −14 x 10 1.5 1 −1 0.5 −1.2 −1.4 0 −1.6 −0.5 −1.8 −1 −1.5 −1 −2 −0.5 0 0.5 −2.2 −1 1 −0.5 0 0.5 1 Figure 4: The figure shows the disappointing error f (t) − fN (t) against t (left) and the more pleasant error f (t) − fN,M (t) (right) for N = 51 and M = 12. Note that fN and fN,M use exactly the same samples. Note that this theorem is just the special case of Theorem 4.1 corresponding to Fourier samples. It is also a straightforward exercise to extend it to the multivariate setting, where F corresponds to the Fourier transform on L2 (Rd ) [1]. This theory extends the classical Shannon Sampling Theorem as well as its many fundamental generalizations [6, 25, 42, 41]. The key point is that, if we know that f is well-represented in {ϕj }j∈N , then we can recover f optimally (up to a multiplicative constant) in terms of the first M basis function ϕ1 , . . . , ϕM using only its first N Fourier coefficients. An important issue that we shall not address in full detail in this paper is how the constant CN,M behaves. In particular, for > 0, how large must N be, for a given M , to ensure that CN,M > 1 − ? Whilst this condition can always be verified numerically (see above), in the important case we consider in this paper (namely, reconstructions in Haar wavelets from Fourier samples) one can show that the scaling N = c()M (for some c() > 0) is sufficient [1]. In particular, whilst setting N = M (this corresponds to the finite section) leads to a divergent and unstable numerical method, in many cases the scaling N = 2M results in a stable and convergent method. 4.4 An example - the effectiveness of generalized sampling To demonstrate the use of generalized sampling let us consider the following function f (t) = t5 e−t , t ∈ [−1, 1]. Suppose we can sample the Fourier coefficients of f : in particular, we have access to {Ff (j)}j∈Z for = 1/2. To reconstruct f from these samples we will try two different techniques. First, we test the truncated Fourier series fN defined in (4.8). Due to the fact that f is not periodic we cannot expect rapid convergence of fN to f . However, the Generalized Sampling Theorem 4.2 allows us to reconstruct in any basis. Thus, (due to analyticity of f ) we will choose the reconstruction basis {ϕj }j∈N consisting of orthonormal Legendre polynomials on [−1, 1]. In particular, we define fN,M as in (4.9), where ρ : N → N is given by ρ(1) = 0, ρ(2) = 1, ρ(3) = −1, ρ(4) = 2, ρ(5) = −2 . . . . (4.10) In Figure 4 we have displayed the errors f −fN and f −fN,M . Note that both reconstructions, fN and fN,M , use the same samples, yet the improvement of fN,M compared to fN is dramatic. In particular, we go from an O (1) error to roughly machine precision. One question we shall not address here is how to determine the relationship between M and N . Note that this is a case where setting M = N gives complete nonsense. We refer to [1, 3, 4] for a complete analysis. We will repeat the experiment above with another function: f (t) = sin(5t) + L X αj ψj (t), j=1 13 t ∈ [−1, 1], −3 x 10 5 60 40 20 0 0 −20 −40 −1 −0.5 0 0.5 −5 −1 1 −0.5 0 0.5 1 Figure 5: The figure shows the large error f (t) − fN (t) against t (left) as well as the substantially smaller error f (t) − fN,M (t) (right) for N = 2401 and M = 1750. Note that fN and fN,M use exactly the same samples. where {ψj }j∈N are the Haar wavelets on [−1, 1], L = 1700 and the αj ’s are some arbitrarily chosen coefficients. We will assume, as above, that we can sample the Fourier coefficients {Ff (j)}j∈Z of f (with = 12 once more). Due to the vast number of discontinuities of f we cannot expect the truncated Fourier series fN to be a good approximation to f . However, by the Generalized Sampling Theorem 4.2 we can choose the reconstruction basis {ϕj }j∈N to be the Haar wavelets, and construct fN,M as in (4.9). In Figure 5 we have displayed the errors f − fN and f − fN,M . Note that both reconstructions, fN and fN,M , use the same samples, yet the improvement of fN,M compared to fN is substantial. 5 Generalized sampling with compressed sensing (GS–CS) An immediate consequence of Theorems 4.1 and 4.2 is that, if we know that f is sparse in {ϕj }j∈N – i.e. supp(f ) ⊂ {1, . . . , M } – then we can recover f perfectly from its first N samples, whenever N is suitably large. However, given that N ≥ M typically for GS to succeed (in particular, when using Haar wavelets, one typically requires N = 2M [1]), the question is, is it possible to use the same ideas in combination with CS techniques to attain subsampling? The answer turns out to be yes (given some minor assumptions), and the key is to follow a similar approach, again based on uneven sections, to formulate the reconstruction appropriately. The result is known P∞ as generalized sampling with compressed sensing (GS–CS). Let us suppose that f = j=1 αj ϕj is sparse in {ϕj }j∈N and is sampled via {ζj }j∈N . As opposed to the failed approaches of §2, which were loosely based on discretising first, the technique we now propose involves first formulating the sparse recovery problem in infinite dimensions. To this end, let Ω ⊂ N be of size |Ω| = m ∈ N and consider the (infinite-dimensional) optimization problem min kηkl1 subject to PΩ U η = PΩ ζ(f ), η∈l1 (N) (5.1) where U is the infinite measurement matrix (4.2) and ζ(f ) = {ζ1 (f ), ζ2 (f ), . . .} is the infinite vector of samples. Recall that GS relies on a well-posed infinite-dimensional recovery problem (4.1) before discretization can proceed. Seeking similar notions for (5.1), we are led to the following questions: (i) How do we choose Ω? Obviously there is no unique choice, but it makes sense to choose Ω uniformly at random from {1, . . . , N }, where N ∈ N. This raises the question following question: how large must N be? (ii) Suppose that η is a minimizer of (5.1) (note that η need not be unique). How large is kη − αk, where α is the infinite vector of coefficients of f in the basis {ϕj }j∈N . In particular, how does kη − αk depend on both m (the total number of samples) and N (the range from which the samples are drawn). (iii) If f is exactly sparse in {ϕj }j∈N , do we recover its coefficient vector α exactly (with high probability) from (5.1), and what conditions on m and N ensure this recovery? 14 Let us suppose for the moment that we have answers to these questions. Naturally we cannot solve (5.1) numerically, hence we must discretize. For this, we follow the same ideas that lead to generalized sampling. Thus, we introduce a parameter k ∈ N and consider the finite-dimensional optimization problem min kηkl1 subject to PΩ U Pk η = PΩ ζ(f ). η∈CM (5.2) We refer to this approach as generalized sampling with compressed sensing. This of course leads to another set of questions: (iv) Will (5.2) have a solution? Note that (5.2) need not have a solution for all k, since PΩ ζ(f ) need not be in the range of PΩ U Pk (although, as we shall show, this is always the case for sufficiently large k). Moreover, will solutions of (5.2) converge to solutions of (5.1)? We will answer this affirmatively. (v) If f is not sparse but compressible, how large is the error kη − αk when η is a solution to (5.2) and α is the vector of coefficients of f ? In particular, if f belongs to either of the models (3.3) or (3.4), can kη − αk be bounded above in terms of some appropriate norm of h? Answers to these questions will be provided in section §7 where we state the main results of this paper. 6 6.1 Notation and definitions Notation We now introduce some additional notation that will be used in the remainder of this paper. Write H = l2 (N), and let k·k be the standard norm on H. All other norms will be specified. Let {ej }j∈N be the natural basis on l2 (N), and, if Γ ⊂ N, define PΓ to be the orthogonal projection onto cl(span{ej : j ∈ Γ}). If Γ = {1, . . . , N }, then we simply write PN . If ξ ∈ H and j ∈ N, then ξ(j) = hξ, ej i (we will also sometimes use the notation ξj ). For Γ ⊂ N, we denote the natural embedding operator by ιΓ : l2 (Γ) → H. Note that ι∗Γ η = η|Γ for η ∈ H. For any vector ξ ∈ H we write supp(ξ) = {j ∈ N : ξ(j) 6= 0}. We also define the sign sgn(ξ) ∈ l∞ (N) of ξ ∈ l∞ (N) as follows: ( ξ(j)/|ξ(j)| if ξ(j) 6= 0 sgn(ξ)(j) = 0 otherwise. For an operator U ∈ B(H) we define incoherence parameter υ(U ) = sup |uij |, uij = hU ej , ei i, (6.1) i,j∈N i.e. the max norm of the operator U with respect to {ej }j∈N . Also, if U = {uij }ij∈N is an infinite matrix, we define the maximum row norm of U by sX |uij |2 . kU kmr = sup i∈N j∈N This quantity forms a vector space norm on the vector space of all infinite matrices (although not an algebra norm). Finally, for convenience, we will define the following crucial function that will be used frequently in the exposition. For M ∈ N and U ∈ B(H) let ω̃M,U : {1, . . . , M } × R+ × N → N be given by ω̃M,U (r, s, N ) = i ∈ N : max kPΓ1 U ∗ PΓ2 U ei k > s . (6.2) Γ ⊂{1,...,M },|Γ |=r 1 1 Γ ⊂{1,...,N } 2 Observe also that the mapping s 7→ ω̃M,U (r, s, N ) is a decreasing function. 15 6.2 Key definition Definition 6.1. Let U ∈ B(H) be an isometry. Then N and m satisfy the weak Balancing Property with respect to U, M and |∆| if r kPM U ∗ PN U PM − PM k ≤ max |Γ|=|∆|,Γ⊂{1,...,M } 4 log2 4N p |∆|/m !−1 , 1 . kPM PΓ⊥ U ∗ PN U PΓ kmr ≤ p 8 |∆| (6.3) (6.4) are satisfied. We say that N and m satisfy the strong Balancing Property with respect to U, M and |∆| if (6.3) holds, as well as max |Γ|=|∆|,Γ⊂{1,...,M } 1 kPΓ⊥ U ∗ PN U PΓ kmr ≤ p . 8 |∆| (6.5) Remark 6.1 The inequality in (6.4) is somewhat inconvenient. However, it can be replaced by the far simpler, although weaker, condition 1 kPM U ∗ PN U PM − diag(PM U ∗ PN U PM )kmr ≤ p . 8 |∆| (6.6) Here diag(B) denotes the diagonal matrix whose entries correspond to the diagonal entries of the matrix B. In particular, condition (6.6) is the requirement on the magnitude of the off-diagonal entries of the matrix PM U ∗ PN U PM . In much the same manner, (6.5) can also be replaced by the much more convenient (however much stronger) condition 1 kU ∗ PN U PM − diag(U ∗ PN U PM )kmr ≤ p . 8 |∆| The following proposition establishes that the balancing property is well defined: Proposition 6.2. If U , M and |∆| are as in Definition 6.1, then there always exists integers N and m that satisfy the weak and strong Balancing Properties with respect to U, M and |∆|. Proof. Note that since PN → I strongly as N → ∞ we have that PN U → U strongly. However, for any Γ ⊂ N with |Γ| < ∞ we have by compactness that PN U PΓ → U PΓ in norm as N → ∞. The fact that U is an isometry yields the assertion. Remark 6.2 Note that it is through the Balancing Property property we determine the number N , the crucial number that determine the set {1, . . . , N } from where we will draw the samples (see §2.3). 7 Main results We now present the main results on GS–CS. Proofs of these results form the content of the remainder of this paper. 7.1 The semi-infinite dimensional model The first results concern the semi-infinite dimensional model (see §3.1): Theorem 7.1. Let U ∈ B(H) be an isometry, M ∈ N, > 0 and suppose that x0 ∈ l1 (N) with supp(x0 ) = ∆, where ∆ ⊂ {1, . . . , M }. Suppose that N and m satisfy the weak Balancing Property with respect to U, M and |∆|, and let Ω ⊂ {1, . . . , N } be chosen uniformly at random with |Ω| = m. If ζ = U x0 then, with probability exceeding 1 − , the problem inf η∈l1 (N) kηkl1 subject to PΩ U PM η = PΩ ζ, 16 (7.1) has a unique solution ξ and this solution coincides with x0 , given that m satisfies ! p M N |∆| m ≥ C · N · υ 2 (U ) · |∆| · log −1 + 1 · log , m (7.2) for some universal constant C. Furthermore, if m = N then ξ is unique and ξ = x0 with probability 1. The main conclusion of this theorem is as follows: a sparse signal x0 can be recovered perfectly (with high probability) by subsampling from the coefficients ζ, provided (6.3), (6.4) and (7.2) hold. Note that this result gives answers to the questions (i) and (iii) posed in §5. Note that Theorem 7.1 confirms Theorem 1.1 of §1. Recall that the second scenario in the semi-infinite dimensional model corresponds to signals y0 = x0 +h, where x0 is sparse and supp(h) ⊂ {1, . . . , M }. The following theorem concerns this case: Theorem 7.2. Let U ∈ B(H) be an isometry, M ∈ N, > 0 and suppose that x0 , h ∈ l1 (N) with supp(x0 ) = ∆, where ∆ ⊂ {1, . . . , M }, and supp(h) ⊂ {1, . . . , M }. Define y0 = x0 + h. Suppose that N and m satisfy the weak Balancing Property with respect to U, M and |∆|, and let Ω ⊂ {1, . . . , N } be chosen uniformly at random with |Ω| = m. If ζ = U y0 and ξ ∈ H is a minimizer of (7.1) then, with probability exceeding 1 − , we have that m 20N + 11 + khkl1 , (7.3) kξ − y0 k ≤ m 2N given that m satisfies (7.2) If m = N then (7.3) holds with probability 1. This theorem demonstrates recovery for compressible signals y0 = x0 + h of the form (3.3): we witness perfect recovery, up to an error determined solely by the magnitude of the nonsparse signal component h. In particular, this result answers part of question (v) posed previously. Remark 7.1 It is important to notice that there need not be a unique solution to (7.1). However, this is not an issue. Theorem 7.2, and, in particular, equation (7.3), states that all solutions to (7.1) will be close to y0 in norm. 7.2 The fully infinite-dimensional model Recall that the semi-infinite dimensional model (3.3) places the restriction that the support of the nonsparse term h is contained in {1, . . . , M }. As discussed in §3, this assumption is quite rare in practice, and a more realistic setting is provided by the fully infinite-dimensional model. Here we assume that y0 = x0 +h, where x0 is sparse and |supp(h)| is infinite. To address this setting, it is first necessary to scrutinize an infinite-dimensional optimization problem of the form (5.1): Theorem 7.3. Let U ∈ B(H) be an isometry, M ∈ N, > 0 and suppose that x0 , h ∈ l1 (N) with supp(x0 ) = ∆, where ∆ ⊂ {1, . . . , M }. Define y0 = x0 + h. Suppose that N and m satisfy the strong Balancing Property with respect to U, M and |∆|, and let Ω ⊂ {1, . . . , N } be chosen uniformly at random with |Ω| = m. If ζ = U y0 and ξ ∈ H is a minimizer of inf η∈l1 (N) kηkl1 subject to PΩ U η = PΩ ζ, (7.4) then, with probability exceeding 1 − , we have that 20N m kξ − y0 k ≤ + 11 + khkl1 , m 2N (7.5) given that m satisfies m ≥ C · N · υ 2 (U ) · |∆| · log ω = ω̃M,U (|∆|, s, N ), −1 s= + 1 · log 32N ωN p |∆| m ! , m p |∆| log(e4 −1 ) for some universal constant C (recall ω̃M,U from (6.2)). If m = N then (7.5) holds with probability 1. 17 (7.6) Remark 7.2 The quantity ω in (7.6) can also be replaced by a much more convenient (and of course much f, where less sharp) estimate. In particular we have that ω ≤ M ( ) m f = min r ∈ N : kPM U ∗ PN kkPN U Pr⊥ k ≤ p M . 32N |∆| log(e4 −1 ) f is finite, since kPN U Pr⊥ k → 0 as r → ∞. for fixed N . Note that M This theorem, much like Theorem 7.2, confirms recovery of y0 up to an error determined solely by h. Note that it resolves questions (i)–(iii) posed in §5. However, note that the optimization problem (7.4) is infinite-dimensional. In practice, one always replaces (7.4) with the finite-dimensional problem inf η∈l1 (N) kηkl1 subject to PΩ U Pk η = PΩ ζ, (7.7) where k ∈ N is suitably chosen. The obvious question now arises: how do solutions of (7.7) compare to those of (7.4) as k → ∞? For this we have the following: Proposition 7.4. Let U ∈ B(H), x0 ∈ l1 (N) and PΩ be a finite rank projection. Then, for all sufficiently large k ∈ N, there exists an ξk ∈ H satisfying kξk kl1 = inf {kηkl1 : PΩ U Pk η = PΩ U x0 } . η∈H Moreover, for every > 0 there is a K ∈ N such that, for all k ≥ K, we have kξk − ξ˜k kl1 ≤ , where ξ˜k satisfies kξ˜k kl1 = inf {kηkl1 : PΩ U η = PΩ U x0 } . (7.8) η∈H In particular, if there exists a unique minimiser x0 of (7.8), then ξk → x0 in the l1 norm. This proposition states that the computed solutions of (7.7) will be approximate minimisers of (7.4) for all sufficiently large k. In particular, computed solutions will approximately satisfy (7.5). Note that it resolves question (iv) posed in §5. Remark 7.3 The amount of subsampling depends on the incoherence parameter υ(U ). For a specific operator U this is fixed, although it can be arbitrarily small. The fact that it is fixed suggests that for large enough M and N subsampling will not be possible – i.e. we must take m = N . However, if U has the property that υ(U Pk⊥ ) → 0 as k → ∞, one can circumvent this problem. This is achieved via semi-random subsampling techniques. This is not within the scope of this paper but will be treated elsewhere [5] (see also §12). 7.3 Theorems on finite-dimensional CS As mentioned, GS–CS generalizes standard finite-dimensional CS to an infinite-dimensional setting. It is therefore unsurprising, but important to note nonetheless, that results concerning the latter can be obtained as straightforward corollaries of Theorems 7.1–7.3. In particular, we have Theorem 7.5. Let U ∈ Cn×n be an isometry, and suppose that x0 ∈ Cn with supp(x0 ) = ∆. For > 0 suppose that m ∈ N is such that m ≥ C · n · υ 2 (U ) · |∆| · log(−1 ) + 1 · log n, (7.9) for some universal constant C, and let Ω ⊂ {1, . . . , n} be chosen uniformly at random with |Ω| = m. If ζ = U x0 then, with probability exceeding 1 − , the problem min kηkl1 subject to PΩ U η = PΩ ζ, η∈Cn has a unique solution ξ and this solution coincides with x0 . 18 Theorem 7.6. Let U ∈ Cn×n be an isometry, and suppose that y0 = x0 + h ∈ Cn with supp(x0 ) = ∆. For > 0 suppose that m ∈ N satisfies (7.9), and let Ω ⊂ {1, . . . , n} be chosen uniformly at random with |Ω| = m. If ζ = U y0 then, with probability exceeding 1 − , any minimizer ξ of the problem min kηkl1 subject to PΩ U η = PΩ ζ, η∈Cn satisfies kξ − y0 k ≤ 20n m + 11 + m 2n khkl1 . e on H. Note that (U e )∗ PN U e = PN , for N = n. Proof. U extends in the obvious way to a partial isometry U e to an isometry U b on H such that υ(U b ) = υ(U ). Therefore, the weak We may, in an obvious way extend U balancing property is automatically satisfied for M = N and any m ∈ N. We now apply Theorem 7.1 or Theorem 7.2. Remark 7.4 Similar results on finite-dimensional compressed sensing has recently been proved by Candès & Plan [13]. However, we stress that their analysis is strictly for the finite-dimensional case. In particular, it cannot be applied to the infinite-dimensional setting considered in this paper. 8 Numerical examples Before presenting proofs of these theorems, it is useful to see some further examples of GS–CS. We will demonstrate the main premise of this paper in practice – in particular, one of the original motivations for GS: provided one knows that the function f has a good representation in terms of a different basis then one can obtain a far better reconstruction of f than that guaranteed by the Shannon Sampling Theorem. Consider the problem of reconstructing g = Ff and f from the samples {ζj (f )}j∈N where ζj (f ) = Ff (ρ(j)), > 0 (we will use = 1/2) and ρ is defined in (4.10). We now compare three methods for approximating f and g: (i) The Shannon reconstructions fN and gN (see (4.8)). (ii) The GS reconstructions fN,M and gN,M (see Theorem 4.2). (iii) The GS–CS reconstructions fN,m,k (t) = k X αj ϕj (t), gN,m,k (t) = j=1 k X αj Fϕj (t), j=1 where α = {α1 , . . . , αk } is computed via the convex optimization problem (5.2). Note that fN,M and gN,M use exactly the same samples as fN and gN . Moreover, fN,m,k and gN,m,k use a subset of the samples used by the previous function, in particular less sampling information is needed. If f is sparse or has rapidly decaying coefficients in Haar wavelets, then we expect (i) to give a very poor reconstruction. However, both the GS and GS–CS methods should give very good reconstructions, with the latter taking advantage of the sparsity to reduce the number of Fourier coefficients sampled (recall that GS does not exploit any sparsity – it offers guaranteed recovery for all functions f by using the full range of Fourier coefficients). 8.1 First example As a first example, let us consider the function g = Ff f (t) = 200 X 9 (t), αj ϕj (t) + cos(2πx)χ[ 12 , 16 ] t ∈ [0, 1], (8.1) j=1 1 9 9 where {ϕj }j∈N are Haar wavelets on [0, 1] and χ[ 21 , 16 ] is the indicator function of the interval [ 2 , 16 ]. Suppose that |{j : αj 6= 0}| = 25, so that f can be decomposed into a sparse component and a remainder. Note that the remainder has infinite support in the Haar wavelet basis, so this function belongs to the fully-infinite dimensional model (see §3.2). 19 −5 −5 1.5 x 10 x 10 1 0.5 0 −5000 4 4 2 2 0 5000 −5000 0 0 5000 −5000 0 0 5000 Figure 6: The figure displays the errors |g(t)−gN (t)| (left), |g(t)−gN,M (t)| (middle) and |g(t)−gN,m,k (t)| (right) against t, for N = 601, M = 200, m = 230 and k = 650. N 601 1201 kg − gN kL∞ 1.43 0.85 kg − gN,M kL∞ −5 kg − gN,m,k kL∞ (avg. 20 trls) 4.74 × 10 , (M = 200) 2.36 × 10−5 , (M = 400) 4.73 × 10−5 , (m = 230, k = 550) 2.38 × 10−5 , (m = 460, k = 1400) Table 1: The tables displays the errors for the reconstructions gN , gN,M and gN,m,k . In Figure 6 we display the errors committed by the approximations (i)–(iii) for this function. As expected, the expansion in sinc functions (i) gives an extremely poor reconstruction, whereas both the GS and GS–CS give far better approximations. Specifically, by replacing the sinc series (i) with either (ii) or (iii) one reduces the error by a factor of roughly 10, 000. Moreover, and also as expected, the GS–CS approximation attains the same numerical error as the GS approximation using only around 38% of the Fourier samples. These observations are confirmed in Table 1. Whilst the GS and GS–CS methods give very similar numerical errors it is important to notice that the reconstructions fN,M and fN,m,k are typically very different. In particular, in GS one reconstructs approximately the first M Haar wavelet coefficients α1 , . . . , αM , where M < N typically. On the other hand, in GS–CS one computes k such coefficients, where typically (although not always) k > N . This discrepancy can be explained by examining the equations (4.5) and (5.2). In GS, which corresponds to (4.5), one requires M < N to ensure invertibility of the operator A. On the other hand, unless k is taken sufficiently large, (5.2) need not have a solution, since the right-hand side PΩ ζ(f ) may not lie in the range of the (finite-dimensional) section PΩ U Pk : Ck → C|Ω| . In particular, this may well be the case whenever k < N . Fortunately, as shown in Proposition 7.4, this cannot happen if k is sufficiently large. The effect of increasing k for the example (8.1) is illustrated in Table 2: once k is sufficiently large, the problem (5.2) has a solution, and this error drops accordingly. 8.2 Second example Consider now the function f (t) = 500 X αj ϕj (t), (8.2) j=1 where |{j : αj 6= 0}| = 100. This function is sparse in the Haar wavelet basis (and hence is an example of the semi-infinite dimensional model). The task is to reconstruct f from its Fourier samples. Unlike the previous example, we expect exact reconstruction of this function using both the GS and GS– CS approaches, provided the parameters are chosen correctly. This is confirmed in Table 3. Observe that the Fourier series of f requires over 50, 000 Fourier samples to achieve four digits of accuracy. Conversely, the GS approximation recovers f exactly using only 1501 such samples. Furthermore, the GS–CS approximation improves over GS by a factor of three: it requires only 450 Fourier samples in total (the reader is referred to Remark 2.4 once more here). 20 EN,m,k = kg − gN,m,k kL∞ (avg. 20 trials) N 601 1201 EN,230,200 = ∞ EN,460,400 = ∞ EN,230,350 = ∞ EN,460,500 = ∞ EN,230,550 = 4.759 × 10−5 EN,460,1000 = 2.384 × 10−5 EN,230,850 = 4.727 × 10−5 EN,460,1300 = 2.392 × 10−5 Table 2: The table shows the error kg−gN,m,k kL∞ for different values of N, m and k (the notation EN,m,k = ∞ means that (5.2) does not have a solution). N 1001 1501 2001 3001 50001 kf − fN kL2 4.19 1.43 1.39 1.37 2.84 × 10−4 kf˜ − fN,M kL2 −2 8.47 × 10 , (M = 500) 4.74 × 10−15 , (M = 500) 4.33 × 10−15 , (M = 500) 4.45 × 10−15 , (M = 500) kf − fN,m,k kL2 (avg. 20 trials) 5.53 × 10−1 , (m = 450, k = 900) 1.06 × 10−10 , (m = 450, k = 900) 1.99 × 10−10 , (m = 450, k = 900) 1.98 × 10−10 , (m = 450, k = 900) Table 3: The table shows the error corresponding to the reconstructions fN , fN,M and fN,m,k of the function (8.2). 8.3 Interlude: outline of the remainder of the paper In this first half of this paper we explained why conventional CS techniques are ill-suited to infinite-dimensional problems, and introduced a new framework GS–CS that successfully overcomes this problem. The main results concerning GS–CS are stated in §7. In the remainder of this paper we present proofs of these results. Some of these are quite technical in nature. The reader who is not concerned with their details may go straight to §12. 9 Infinite-dimensional optimization and Proposition 7.4 We begin the second half of this paper with a proof of Proposition 7.4. As the informed reader will have noticed, this is really an question of infinite-dimensional optimization: in particular, showing the existence of minimizers to the finite-rank discretizations of an infinite-dimensional optimization problem, and their convergence to minimizers of that problem. For this reason, we now recap some of the basics of this field. The well-informed reader may proceed directly to Proposition 9.4. Also, some of the results below, although new, are included only for completeness, and the reader only interested in the proof of the main theorems may go directly to Proposition 10.4. 9.1 Infinite-dimensional optimization The field of infinite-dimensional convex optimization is certainly not new [24, 35]. However, it is much less standard than the more thoroughly investigated topic of finite-dimensional convex optimization. We will now cover some of the basic tools that will subsequently prove useful. In this paper we will consider complex vector spaces. Standard optimization theory is usually considered over the reals, and this is also the case in [24] (the main reference we consider herein for the field of infinitedimensional optimization). To be able to able to quote [24] freely we use the standard trick and consider any e is the real Banach space induced by X complex Banach space X as a real vector space. In particular, if X then e ∗ = {Re(x∗ ) : x∗ ∈ X ∗ }. X This follows by the observation that if x∗ ∈ X ∗ and u = Re(x∗ ) then u is a real linear functional. Also, if e ∗ and x∗ : X → C is defined by x∗ (x) = u(x) − iu(ix), then x∗ ∈ X ∗ . To avoid unnecessary clutter u∈X we will (with slight abuse of notation) use X as the notation for X̃. Definition 9.1. Let X be a Banach space and let F : X → R. The polar function F ∗ : X ∗ → R is defined by F ∗ (x∗ ) = sup {Re(x∗ (x)) − F (x)}, x∈X 21 where R = R ∪ {−∞, ∞}. Definition 9.2. Let X be a Banach space, F : X → R be convex and consider the following problem (P ) : inf{F (x) : x ∈ X}. If Y is a Banach space and Φ : X × Y → R ∪ {∞} is a convex lower semi-continuous function such that Φ(x, 0) = F (x) for all x ∈ X, then the dual problem P ∗ with respect to Φ is defined by (P ∗ ) : sup{−Φ∗ (0, y ∗ ) : y ∗ ∈ Y ∗ }. If Φ is not specified we will say that (P ∗ ) is a dual problem for (P ). Let X and Y be Banach spaces and suppose that T ∈ B(X, Y ) and y0 ∈ Y . Consider the problem (P1 ) : inf{kxk : x ∈ X, T x = y0 }. Note that (P1 ) can be written as the equivalent convex optimization problem: (Pe1 ) : inf{F (x) + G(T x), x ∈ X}, (9.1) where F (x) = kxk and G : Y → R ∪ {∞} is defined by G(z) = δ{0} (z − y0 ). Here the function δC : Y → R ∪ {∞}, where C ⊂ Y is convex, is defined by δC (z) = 0 if z ∈ C and δC (z) = ∞ if z ∈ / C. Moreover, by letting Φ : X × Y → R ∪ {∞} be defined by Φ(x, y) = F (x) + G(T x + y), (9.2) and observing that Φ∗ (x∗ , y ∗ ) = F ∗ (x∗ − T 0 y ∗ ) + G∗ (y ∗ ), where T 0 : Y ∗ → X ∗ denotes the dual mapping, we also obtain the following dual problem with respect to Φ: (P1∗ ) : sup{−F ∗ (−T 0 y ∗ ) − G∗ (y ∗ ) : y ∗ ∈ Y ∗ }. Much like (P1 ) and (P̃1 ), the dual problem (P1∗ ) also has an equivalent form. In fact, since F ∗ (x∗ ) = 0 if kx∗ kX ∗ ≤ 1 and F ∗ (x∗ ) = ∞ if kx∗ kX ∗ > 1, together with the observation that G∗ (y ∗ ) = sup{Re(y ∗ (y)) − δ{0} (y − y0 ), y ∈ Y } = Re(y ∗ (y0 )), we find that (P1∗ ) : sup{Re(y ∗ (y0 )) : kT 0 y ∗ kX ∗ ≤ 1, y ∗ ∈ Y ∗ }. Using these ideas we obtain the following well-known result [24]: Proposition 9.3. Let X and Y be Banach spaces and suppose that T ∈ B(X, Y ) and y0 ∈ Y. If T is onto, then inf{kxk : x ∈ X, T x = y0 } = sup{Re(y ∗ (y0 )) : kT 0 y ∗ kX ∗ ≤ 1, y ∗ ∈ Y ∗ }. Proof. Let F, G and Φ be as in (9.1) and (9.2) respectively, and define h : Y → R ∪ {∞} by h(y) = inf Φ(x, y). x∈X Then h is convex and, since T is onto, h is finite for all y ∈ Y . Therefore, by convexity, h is also continuous, and, in particular continuous at zero. The result follows now by [24, Prop. 3.3.5]. 9.2 Proof of Proposition 7.4 We are now in a position to prove Proposition 7.4. We first require the following: Lemma 9.4. Let U ∈ B(H) and P be a finite rank projection. Then, for every χ ∈ Ran(P U ), there exists ξ ∈ H satisfying inf kηkl1 subject to P U η = χ. 1 η∈l (N) 22 Proof. Recall that (c0 )∗ = l1 . By weak∗ compactness there is a sequence {ξk } ⊂ l1 and a ξ ∈ l1 such that P U ξk = χ, kξk kl1 & inf{kηkl1 : P U η = χ} and hξk , ej i → hξ, ej i as k → ∞ for all j ∈ N. It follows that kξkl1 ≤ limk→∞ kξk kl1 . Since ξk → ξ weakly as elements in H it follows by the fact that P U is compact (since P is of finite rank) that P U ξk → P U ξ. Hence, P U ξ = χ, as required. We now give a proof of Proposition 7.4: Proof of Proposition 7.4. To see the existence of ξk for large k it suffices to observe that Ran(PΩ U ) and Ran(PΩ U Pk ) coincide for all sufficiently large k, since PΩ has finite rank. For the second part of the proposition, it is easy to see that it suffices to show that every subsequence of {ξk }k∈N has a convergent subsequence in the l1 norm with limit ξ satisfying kξkl1 = inf {kηkl1 : PΩ U η = PΩ U x0 } . η∈H (9.3) Let therefore {ξk }k∈N be a subsequence of the original sequence (we use the same notation for simplicity). Since kξk kl1 ≥ kξk+1 kl1 for all large k it follows that {ξk } is bounded. So by weak∗ compactness of the l1 ball we have that, by possibly passing to a subsequence, there is a ξ ∈ H such that ξk → ξ weakly (as elements in H) as k → ∞. By compactness of PΩ U we find that PΩ U ξk → PΩ U ξ as k → ∞, and, since PΩ U ξk = PΩ U x0 , it follows that PΩ U ξ = PΩ U x0 . To see that ξ satisfies (9.3) we argue as follows. We claim that for any λ > 0 we have kξk kl1 ≤ inf {kηkl1 : PΩ U η = PΩ U x0 } + λ, η∈H (9.4) for all sufficiently large k. Let r = dim(Ran(PΩ U )) < ∞, and let ê1 , . . . , êr be coordinate vectors such that span{PΩ U êj }rj=1 = Ran(PΩ U ). Then every η ∈ Ran(PΩ U ) with kηk = 1 can be written as η = c1 PΩ U ê1 + . . . + cr PΩ U êr , where the cj s are bounded by, say, 1 ≤ c < ∞. Now let ξ˜ be a minimizer of (9.3) (the existence of such a minimizer is guarantied by Proposition 9.4), and choose k so large that ˜ ≤ λ/(2cr) and kP ⊥ ξk ˜ ≤ λ/2. Let c1 , . . . , cr be chosen such that {êj }rj=1 ⊂ Ran(Pk ), kPΩ U Pk⊥ ξk k ⊥˜ ⊥˜ ˜ ˜ PΩ U Pk⊥ ξ/kP Ω U Pk ξk = c1 PΩ U ê1 + . . . + cr PΩ U êr , and set η̃ = Pk ξ + (c1 ê1 + . . . cr êr )kPΩ U Pk ξk. It ˜ ˜ ˜ follows that PΩ U η̃ = PΩ U ξ = PΩ U x0 , kη̃kl1 ≤ kξkl1 + λ and η̃ ∈ Ran(Pk ). Hence kξk kl1 ≤ kξkl1 + λ ⊥ ⊥ ξkl1 . ξkl1 ≤ λ. Then kξkl1 ≤ kPm ξkl1 + kPm and we have shown (9.4). Now choose m ∈ N such that kPm But Pm ξk → Pm ξ and ξk satisfies (9.4), thus kξkl1 ≤ inf η∈H {kηkl1 : PΩ U η = PΩ U x0 } + 2λ for any λ > 0. Therefore ξ satisfies (9.3), as required. For the final part of the proof, we are required to show that kξk − ξkl1 → 0 as k → ∞. By possibly passing to another subsequence, it follows by (9.4) that kξk kl1 ≤ inf {kηkl1 : PΩ U η = PΩ U x0 } + 1/k. η∈H (9.5) Note also that, for fixed m ∈ N, we have Pm (ξk − ξ) → 0 as k → ∞. But by (9.5) we also have kPm ξk kl1 + ⊥ ⊥ kPm ξk kl1 ≤ kPm ξkl1 + kPm ξkl1 + 1/k. So ⊥ lim lim sup kPm ξk kl1 = 0. m→∞ k→∞ It thus follows that ξk → ξ (in l1 ) as k → ∞, and we are done. 9.3 Existence of unique minimizers In what follows it will be useful to have several results on the existence of unique minimizers of such problems. The finite-dimensional version of the following proposition has become standard for showing existence of unique minimizers for finite-dimensional problems found in CS [15]. Fortunately, the extension to infinite dimensions is rather straightforward: Proposition 9.5. Let U ∈ B(H) be unitary and let Ω, ∆ ⊂ N be such that |Ω|, |∆| < ∞. Suppose that x0 ∈ H and that supp(x0 ) = ∆. Consider the optimization problem inf kηkl1 subject to PΩ U η = PΩ U x0 . η∈H If x0 is the unique minimizer of (9.6), then there exists a vector ρ ∈ H such that 23 (9.6) (i) ρ = U ∗ PΩ η for some η ∈ H (ii) hρ, ej i = hsgn(x0 ), ej i, (iii) |hρ, ej i| < 1, j∈∆ j∈ / ∆. Conversely, if (i)-(iii) are satisfied and in addition PΩ U P∆ : P∆ H → PΩ H has full rank, then x0 is the unique minimizer of (9.6). This proposition (for a proof, see the Appendix) may be a little hard to work with in practice. However, a more convenient result with somewhat relaxed assumptions can also be obtained. Proposition 9.6. Let U ∈ B(H) with kU k ≤ 1 and suppose that ∆ and Ω are finite subsets of N. Let x0 ∈ H such that supp(x0 ) = ∆. Let M ∈ N and suppose that M is so large that ∆ ⊂ {1, . . . , M }. Let ξ, ξM ∈ H such that kξkl1 = inf {kηkl1 : PΩ U η = PΩ U x0 }, η∈H kξM kl1 = inf {kηkl1 : PΩ U PM η = PΩ U x0 }. η∈H Suppose that there is a ρ ∈ ran(U ∗ PΩ ) and a q > 0 with the following properties (i) kq −1 P∆ U ∗ PΩ U P∆ − P∆ k ≤ 1/2, √ (ii) kP∆ ρ − sgn(x0 )k ≤ q/4, ⊥ (iii) kP∆ ρkl∞ ≤ 1/2, ⊥ ρkl∞ ≤ 1/2 then ξM = x0 then ξ = x0 . Also, if (i) and (ii) are satisfied and (iii) is replaced with kPM P∆ q ⊥ Proof. Let ζ = ξ − x0 . We will show that ζ = 0. We begin by showing that kP∆ ζk ≤ 2q kP∆ ζk. This follows from some simple observations. First note that by a small computation and (i) we have q kPΩ U P∆ ζk2 ≥ q(1 − kq −1 P∆ U ∗ PΩ U P∆ − P∆ k)kP∆ ζk2 ≥ kP∆ ζk2 . 2 q ⊥ ⊥ ⊥ ζk. Thus, if kP∆ ζk > 2q kP∆ ζk ≥ kPΩ U P∆ ζk we get Also, by assumption, we obviously have kP∆ ⊥ ⊥ kPΩ U P∆ ζk > kP∆ ζk ≥ kPΩ U P∆ ζk. Since PΩ U ζ = 0 this is a contradiction. Let us now note the following: for j ∈ ∆ we have |(x0 + ζ)(j)| = ||(x0 )(j)| + ζ(j)sgn(x0 )(j)| ≥ |(x0 )(j)| + Re(ζ(j)sgn(x0 )(j)). Since supp(x0 ) = ∆ we obtain kx0 + ζkl1 ≥ kx0 kl1 + Rehζ, sgn(x0 )i + X |ζ(j)|, (9.7) j∈∆c where ∆c = N\∆. Also, by the assumption that ρ ∈ ran(U ∗ PΩ ) and the fact that PΩ U ζ = 0, it follows that ζ ⊥ ρ.q Thus, using (9.7) we obtain (by applying (ii), (iii), Hölder’s inequality and finally the observation ⊥ kP∆ ζk ≤ 2q kP∆ ζk) ⊥ kx0 + ζkl1 ≥ kx0 kl1 + Rehζ, sgn(x0 ) + P∆ sgn(ζ) − ρi ⊥ ⊥ ≥ kx0 kl1 + kP∆ ζkl1 − (|hζ, sgn(x0 ) − P∆ ρi| + |hζ, P∆ ρi|) √ q 1 ⊥ ⊥ ≥ kx0 kl1 + kP∆ ζkl1 − kP∆ ζkl1 + kP∆ ζkl1 4 2 ! √ 2 ⊥ 1 ⊥ ⊥ ≥ kx0 kl1 + kP∆ ζkl1 − kP∆ ζkl1 + kP∆ ζkl1 . 4 2 (9.8) Thus, if ζ 6= 0 this gives kx0 + ζkl1 > kx0 kl1 contradicting the fact that kξkl1 ≤ kx0 kl1 . Hence ζ = 0, and this gives the first part of the proposition. The argument for the second part of the proposition is almost identical. By letting ζ = ξM − x0 we may use exactly the same analysis as previously, except for the transition from the second line in (9.8) to the third line. In that case, since ζ ∈ ran(PM ), we only need the ⊥ requirement that kPM P∆ ρkl∞ ≤ 1/2. 24 10 Stability analysis for infinite-dimensional convex optimization In the previous section we established conditions that guarantee recovery of x0 ∈ l1 (N) by solving min kηkl1 subject to PΩ U η = PΩ U x0 , η∈l1 (N) (10.1) and its finite-dimensional approximations min kηkl1 subject to PΩ U Pk η = PΩ U x0 . η∈l1 (N) (10.2) In particular, we gave a proof of Proposition 7.4. We now consider the issue of stability in such optimization problems. In other words, we consider the effect of replacing x0 by x0 + h, where h is small in norm, on the minimizers ξ and ξk of (10.1) and (10.2) respectively. Note that this is the first step towards a proof of Theorems 7.2 and 7.3 concerning the recovery of compressible signals which are described by the semi/fully infinite-dimensional models §3. However, at this moment we do not consider either sparsity or randomness. This comes in §11, in which the results proved in this and the previous section are applied to the sparse recovery problems (7.1) and (7.4) to yield proofs of Theorems 7.1–7.3. 10.1 Stability Stability turns out to be a rather subtle issue. We now illustrate why. Definition 10.1. Let Ω, ∆ be finite subsets of N, U ∈ B(H) and let f : R+ → R+ be a continuous function such that limt→0 f (t) = 0. If ξ ∈ H, supp(ξ) = ∆, is the unique minimizer of inf{kηkl1 : PΩ U η = PΩ U ξ}, (10.3) and for any > 0 and ζ ∈ H such that kξ − ζkl1 ≤ , we have that kx − ξkl1 ≤ f (), where x is a minimizer of inf{kηkl1 : PΩ U η = PΩ U ζ}, then we say that {U, Ω, ∆} is locally f -stable at ξ. If f (t) = Ct for some constant C > 0 then {U, Ω, ∆} is said to be locally linearly stable at ξ. We say that {U, Ω, ∆} is globally f -stable (linearly stable) if the above statements hold for all ξ ∈ H, supp(ξ) = ∆, such that ξ is the unique minimizer of (10.3). Proposition 10.2. Let U ∈ B(H) be unitary and let Ω, ∆ be finite subsets of N. Suppose that {U, Ω, ∆} is globally f -stable. Suppose also that there exists x ∈ H, supp(x) = ∆, such that x is the unique minimizer of inf{kηkl1 : PΩ U η = PΩ U x}. Then, if (PΩ U P∆ )∗ PΩ U P∆ |P∆ H is invertible, and y ∈ H, supp(y) = ∆, is arbitrary, then y is the unique minimizer of inf{kηkl1 : PΩ U η = PΩ U y}. Proposition 10.3. Let U ∈ B(H) be unitary and let Ω, ∆ be finite subsets of N. Suppose that for any ξ ∈ H, supp(ξ) = ∆, then ξ is the unique minimizer of inf{kηkl1 : PΩ U η = PΩ U ξ}, and also that (PΩ U P∆ )∗ PΩ U P∆ |P∆ H is invertible. Then, {U, Ω, ∆} is globally linearly stable. These results establish the relationship between global stability and the existence of unique minimizers (proofs are given in the Appendix). In particular, existence of unique minimizers for all y with supp(y) = ∆ is (almost) equivalent to global stability. Thus, global stability is a rather strict condition and may be difficult to achieve. However, we will be concerned with a fixed signal to recover and hence global stability may not be necessary. Conditions in order to establish local stability are the topic in the next section. 10.2 The key result The key result of this section, which will later lead to the proofs of Theorems 7.1–7.3, is the following: Proposition 10.4. Let U ∈ B(H) with kU k ≤ 1, and suppose that ∆ and Ω are finite subsets of N. Let x0 , h ∈ H be such that supp(x0 ) = ∆ and khkl1 < ∞, and suppose that ∆ ⊂ {1, . . . , M } for some M ∈ N. Let ξ, ξM ∈ H satisfy kξkl1 = inf {kηkl1 : PΩ U η = PΩ U (x0 + h)}, η∈H 25 (10.4) kξM kl1 = inf {kηkl1 : PΩ U PM η = PΩ U (x0 + PM h)}. η∈H If there exists ρ ∈ ran(U ∗ PΩ ) and q > 0 with the following properties: (i) kq −1 P∆ U ∗ PΩ U P∆ − P∆ k ≤ 1/2, (ii) kP∆ ρ − sgn(x0 )k ≤ q/8, ⊥ (iii) kP∆ ρkl∞ ≤ 1/2, then kξ − x0 k ≤ 20 q + 10 + q 2 khkl1 . (10.5) ⊥ Also, if (i) and (ii) hold and (iii) is replaced with kPM P∆ ρkl∞ ≤ 1/2 then q 20 + 10 + kPM hkl1 . kξM − x0 k ≤ q 2 (10.6) Proof. Note that (10.4) and (i) yield ⊥ PΩ U (x0 − P∆ ξ) = PΩ U (P∆ ξ − h) ⇒ ⊥ P∆ U ∗ PΩ U (x0 − P∆ ξ) = P∆ U ∗ PΩ U (P∆ ξ − h) ⇒ (10.7) ⊥ x0 − P∆ ξ = (P∆ U ∗ PΩ U P∆ )−1 P∆ U ∗ PΩ U (P∆ ξ − h). (note that (i) implies that P∆ U ∗ PΩ U P∆ is invertible). Hence, from (i) and (10.7), and by using the fact that kU k ≤ 1 we obtain ⊥ kx0 − P∆ ξk ≤ 2/qkP∆ ξ − hk. (10.8) Thus, kx0 − ξk ≤ ⊥ 2/qkP∆ ξ − hk + ⊥ kP∆ ξk ≤ 2 2 ⊥ + 1 kP∆ ξkl1 + khkl1 . q q (10.9) ⊥ The rest of the proof is therefore devoted to show that kP∆ ξkl1 is bounded by a constant times khkl1 . ∗ Note that the fact that ρ ∈ ran(U PΩ ) and PΩ U (ξ − (x0 + h)) = 0 implies that hξ, ρi = hx0 + h, ρi. Thus, it follows, by appealing to (iii), that Re(hx0 , ρi) + Re(hh, ρi) = Re(hξ, ρi) ≤ Re(hξ, P∆ ρi) + 1 X |ξ(j)|. 2 c (10.10) j∈∆ Hence, from (10.10), (ii) and (iii) we get Re(hx0 − ξ, P∆ ρi) − (1 + q/8)khkl1 ≤ 1 X |ξ(j)|, 2 c j∈∆ and, by some simple adding and subtracting, we obtain 1 ⊥ ⊥ ξkl1 . Re(hx0 − ξ, P∆ ρi) − kP∆ ξkl1 −(1 + q/8)khkl1 ≤ − kP∆ 2 (10.11) We return to this equation, but for the meantime we will continue to investigate the quantity Re(hx0 − ⊥ ξ, P∆ ρi) − kP∆ ξkl1 . Note that ⊥ kx0 kl1 − kξkl1 + Re(hx0 − ξ, P∆ ρ − sgn(x0 )i) ≤ Re(hx0 − ξ, P∆ ρi) − kP∆ ξkl1 . (10.12) Moreover, because of (10.4) and (10.12), it follows that ⊥ −khkl1 + Re(hx0 − ξ, P∆ ρ − sgn(x0 )i) ≤ Re(hx0 − ξ, P∆ ρi) − kP∆ ξkl1 , so by appealing to (10.13), (10.8) and (ii) we find that ⊥ ⊥ −khkl1 − 1/4kP∆ ξ − hk ≤ Re(hx0 − ξ, P∆ ρi) − kP∆ ξkl1 , 26 (10.13) and thus 5 1 ⊥ ⊥ − khkl1 − kP∆ ξkl1 ≤ Re(hx0 − ξ, P∆ ρi) − kP∆ ξkl1 . 4 4 By inserting the latter into (10.11) we finally obtain ⊥ kP∆ ξkl1 ≤ (9 + q/2)khkl1 . (10.14) Substituting (10.14) into (10.9) now yields (10.5). The proof of (10.6) is almost identical, and we omit the details. 11 Proofs of the main results 11.1 The Idea Before we present proofs of Theorems 7.1–7.3, we would like to sketch the key ideas. Our approach is to use Proposition 10.4 to show the existence of some ρ ∈ ran(U ∗ PΩ ) with the following properties (i) kθ−1 P∆ U ∗ PΩ U P∆ − P∆ k ≤ 1/2, (ii) kP∆ ρ − sgn(x0 )k ≤ θ/8 ⊥ (iii) kPM P∆ ρkl∞ ≤ 1/2, for some θ > 0 (recall the setup in Theorems 7.1 and 7.2). Throughout the paper we will be concerned with randomly choosing a set Ω ⊂ {1, . . . , N }. In our models we will choose Ω uniformly at random, however, in some of the proofs we will also use another approach that renders the analysis possible, whilst not affecting the model unduly. We will typically take a sequence {δ1 , . . . δN } of independent identically distributed Bernoulli random variables taking values 0 and 1 with P(δj = 1) = q for all j and let Ω = {j : δj = 1}. We will refer to this type of random selection of Ω as the Bernoulli model and we will denote such a procedure by {N, . . . , 1} ⊃ Ω ∼ Ber(q). We will assume that {N, . . . , 1} ⊃ Ω ∼ Ber(θ), for some finite N ∈ N. However, we will construct Ω in an equivalent, but slightly different way. Namely, we let Ω = Ω1 ∪ Ω2 ∪ · · · ∪ Ωµ , Ωj ∼ Ber(qj ), where the specific value of µ will be determined later. Note that as long as the qj s are chosen according to θ this is equivalent to letting Ω ∼ Ber(θ). Indeed, we have that Ω ∼ Ber(θ) is equivalent to Ωc ∼ Ber(1 − θ). So, for k ∈ {1, . . . , N }, we have P(k ∈ Ωc ) = (1 − θ), where Ωc = {1, . . . , N }\Ω. But P(k ∈ (Ω1 ∪ Ω2 ∪ · · · ∪ Ωµ )c ) = (1 − q1 )(1 − q2 ) · · · (1 − qµ ). Thus, if we let (1 − q1 )(1 − q2 ) · · · (1 − qµ ) = (1 − θ) (11.1) it is easy to see (by independence) that the two models are equivalent. Note that, obviously, there might be overlaps between the Ωj s. This automatically gives us the following: q1 + q2 + . . . + qµ ≥ θ. This fact will be used several times in the arguments that follow and is a very crucial observation. We can now present the Golfing Scheme. 11.2 The Golfing Scheme Let U ∈ B(H) be an isometry and let {N, . . . , 1} ⊃ Ωj ∼ Ber(qj ) for j = 1, . . . , µ for some µ ∈ N where the qj s satisfy (11.1) for some 0 < θ ≤ 1. Suppose also that x0 ∈ H. Define the operator EΩj = U ∗ PΩj U, 27 j = 1, . . . , µ. The construction of ρ is based on the following idea. Let ρ = Yµ , Yi = i X qj−1 EΩj Zj−1 (11.2) j=1 Zi = sgn(x0 ) − P∆ Yi , Z0 = sgn(x0 ), where the specific value of µ will be determined later. The construction suggested in (11.2) will be referred to as the golfing scheme, and is a variant of the extremely useful original golfing scheme introduced in [27] by David Gross. The actual construction will differ slightly from the one suggested here, however, this should give the reader an idea about the approach. Before we can prove the theorems we need to establish some results that will be crucial in the construction of ρ. 11.3 The Proofs We first require the following three results. Proofs are found in the Appendix: Proposition 11.1. Let U ∈ B(H) be an isometry. Let {N, . . . , 1} ⊃ Ω ∼ Ber(q) for some 0 < q ≤ 1, and ∆ ⊂ N with |∆| < ∞. Also, let M ∈ N be so large that ∆ ⊂ {1, . . . , M } and define EΩ = U ∗ PΩ U . Then, for η ∈ H and t, γ > 0 ⊥ ⊥ ∗ P kq −1 PM P∆ EΩ P∆ ηkl∞ > (t + kPM P∆ U PN U P∆ kmr )kηk ≤ γ (11.3) provided q≥ ! √ 2 2p 4 c 4 + |∆| · log |∆ ∩ {1, . . . , M }| · υ 2 (U ). t2 3t γ Also, ⊥ ⊥ ∗ P kq −1 P∆ EΩ P∆ ηkl∞ > (t + kP∆ U PN U P∆ kmr )kηk ≤ γ whenever q≥ (11.4) ! √ 4 2 2p + |∆| · log (4ω/γ) · υ 2 (U ), t2 3t where ω = ω̃M,U (|∆|, tq, N ) (recall ω̃M,U from (6.2)). In addition, if q = 1,the left-hand sides of (11.3) and (11.4) are equal to zero. Proposition 11.2. Let U ∈ B(H) be an isometry, ∆ ⊂ N with |∆| < ∞ and {N, . . . , 1} ⊃ Ω ∼ Ber(q) for some 0 < q ≤ 1. Then, for η ∈ H and t, γ > 0, we have P q −1 P∆ U ∗ PΩ U P∆ − P∆ η > (t + kP∆ U ∗ PN U P∆ − P∆ k) kηk ≤ γ, provided q(1 − q)−1 ≥ 4t−2 · υ 2 (U ) · |∆|, and t max{q −1 − 1, 1} 2K K −1 2 log 1 + ≥ max{q − 1, 1} · υ (U ) · |∆| · log , 2q −1 (1 − q) t γ where K is the constant in Talagrand’s Theorem 14.2. Theorem 11.3. There exists a constant C > 0 with the following property. Suppose that U ∈ B(H) is an isometry, ∆ a finite subset of N and {N, . . . , 1} ⊃ Ω ∼ Ber(θ) for some 0 < θ ≤ 1 . Then, for > 0 and γ > 0 we have that 1 P θ−1 P∆ U ∗ PΩ U P∆ − P∆ ≥ + kP∆ U ∗ PN U P∆ − P∆ k ≤ , (11.5) γ provided that θ ≥ C · γ · υ 2 (U ) · |∆| · log(|∆|), θ ≥ C · γ · υ 2 (U ) · |∆| · log(C−1 ) · log 1 + If θ = 1 then the left hand side of (11.5) is equal to zero. 28 1 γ (1 − θ) −1 (11.6) . With these results established, we can now embark on the task of proving the main theorems of this paper. Proof of Theorem 7.1 and Theorem 7.2. The set Ω ⊂ {1, . . . , N } is chosen uniformly at random with |Ω| = m. By Proposition 10.4 it suffices to show that there exists a ρ ∈ ran(U ∗ PΩ ) such that (i) kθ−1 P∆ U ∗ PΩ U P∆ −P∆ k ≤ 1/2, (ii) kP∆ ρ−sgn(x0 )k ≤ θ/8, ⊥ (iii) kPM P∆ ρkl∞ ≤ 1/2, (11.7) with large probability. Note that we may (without loss of generality) replace this way of choosing Ω with the model that {N, . . . , 1} ⊃ Ω ∼ Ber(θ) for θ = m/N (θ will have this value throughout the proof). Doing so may only change the constant C in (7.2). This trick has almost become standard in the literature and we will thus skip the specifics (see [14, 15] for details). Note that, as discussed in Section 11.1, the model {N, . . . , 1} ⊃ Ω ∼ Ber(θ) is equivalent to choosing Ω as Ω = Ω1 ∪ Ω2 ∪ · · · ∪ Ωµ , Ωj ∼ Ber(qj ), for some µ ∈ N with (1 − q1 )(1 − q2 ) · · · (1 − qµ ) = (1 − θ). (11.8) The latter model is the one we will use throughout the proof and the specific value of µ will be chosen later. The theorems will follow if we can show that the conditions in (11.7) occur with probability exceeding 1 − , and what follows is a setup to ensure this eventually. We will focus on (ii) and (iii) in (11.7) and deal with (i) at the end of the proof. The proof will proceed in a number of steps. Step I (The construction of ρ): Let ν be a positive number such that ν ≤ µ and let {α1 , . . . , αµ } and {β1 , . . . , βµ } be sequences of positive numbers. The values of µ, ν, {αi }µi=1 and {βi }µi=1 will be carefully chosen later in the proof. Consider now the following construction of ρ : let Z0 = sgn(x0 ), and define recursively the sequences {Zi }µi=0 ⊂ H, {Yi }µi=1 ⊂ H and {Θi }µi=1 ⊂ N as follows: first define Zi = sgn(x0 ) − P∆ Yi , Yi = i X qj−1 EΩj Zj−1 , i = 1, 2, j=1 where EΩj = U ∗ PΩj U, and {q1 , . . . , qµ } stem from (11.8). The precise values of the qj ’s will be chosen later. Let also Θ1 = {1} and Θ2 = {1, 2}. Then define recursively, for i ≥ 3, the following: −1 ≤ αi kZi−1 k , Θi−1 ∪ {i} if P∆ − qi P∆ EΩi P∆ Zi−1 ⊥ Θi = (11.9) and qi−1 PM P∆ EΩi P∆ Zi−1 l∞ ≤ βi kZi−1 k, Θi−1 otherwise, (P −1 j∈Θi qj EΩj Zj−1 if i ∈ Θi , Yi = i ≥ 3, Yi−1 otherwise, ( sgn(x0 ) − P∆ Yi if i ∈ Θi , Zi = i ≥ 3. Zi−1 otherwise, Now, let {Ai }2i=1 and {Bi }4i=1 denote the following events P∆ − q −1 P∆ EΩi P∆ Zi−1 ≤ αi kZi−1 k , Ai : i = 1, 2, i −1 ⊥ q PM P∆ EΩi P∆ Zi−1 ∞ ≤ βi kZi−1 k, Bi : i = 1, 2, i l B3 : |Θµ | ≥ ν, B4 : (∩2i=1 Ai ) ∩ (∩3i=1 Bi ). Also, let τ (j) denote the j th element in Θµ (e.g. τ (1) = 1, τ (2) = 2 etc.) and finally define ρ by ( Yτ (ν) if B4 occurs , ρ= sgn(x0 ) otherwise. 29 (11.10) Note that, clearly, ρ ∈ ran(U ∗ PΩ ) if B4 occurs. Now make the following observations: a straightforward calculation (using the fact that Z0 = sgn(x0 )) yields −1 −1 Zτ (i) = sgn(x0 ) − P∆ qτ−1 E sgn(x ) + q E Z + . . . + q E Z Ω 0 Ω 1 Ω τ (i−1) τ (1) τ (2) τ (i) (1) τ (2) τ (i) (11.11) −1 = (P∆ − qτ (i) P∆ EΩτ (i) P∆ )Zτ (i−1) , i ≤ |Θµ |. Hence, if the event B4 occurs, we have kP∆ ρ − sgn(x0 )k = kZτ (ν) k ≤ p |∆| ν Y ατ (i) , (11.12) i=1 ⊥ kPM P∆ ρkl∞ ≤ ν X ⊥ ∞ kqτ−1 (i) PM P∆ EΩτ (i) Zτ (i−1) kl i=1 ≤ ν X βτ (i) kZτ (i−1) k ≤ p |∆| ν X βτ (i) i=1 i=1 i−1 Y (11.13) ατ (j) , j=1 and ρ ∈ ran(U ∗ PΩ ). We will now show that with a certain choice of parameters ν, {βj }µj=1 and {αj }µj=1 then (ii) and (iii) in (11.7) are satisfied when the event B4 occurs. We delay specifying a the value for µ until Step IV. Let L ≥ 2, (we will give a value for L in a moment) and α1 = α2 = 1 1/2 2 log2 (L) 1 β1 = β2 = p , 4 |∆| It follows that , αi = 1/2, 3 ≤ i ≤ µ, p log2 (4θ−1 |∆|) p βi = , 4 |∆| ν Y p |∆| ατ (i) = i=1 3 ≤ i ≤ µ. p |∆| . 2ν log2 (L) Hence, if l p m ν = log2 8θ−1 |∆| , (11.14) then it follows by (11.12) that kP∆ ρ − sgn(x0 )k ≤ θ/8 (recall that L ≥ 2) yielding (ii) in (11.7). Also, after inserting the values of ν, {βj }µj=1 and {αj }µj=1 into (11.13) we get: p |∆| ν X i=1 1 = 4 βτ (i) i−1 Y ατ (j) j=1 ! p p p 1 1 1 log2 (4θ−1 |∆|) 1 log2 (4θ−1 |∆|) 1 log2 (4θ−1 |∆|) 1+ + + + . . . + ν−1 2 log1/2 (L) 4 log2 (L) 8 log2 (L) 2 log2 (L) 2 1 ≤ , 2 if we let L = 4θ−1 p |∆|. Thus, by (11.13) we have ⊥ kPM P∆ ρkl∞ ≤ 1/2, yielding (iii) in (11.7). In particular, we have showed that, if ν, {βj }µj=1 and {αj }µj=1 are chosen as above, then (ii) and (iii) are satisfied when B4 occurs. Thus, we have now obtained a framework for proving (ii) and (iii) in (11.7) with a certain probability. To do this, we will make a careful choice of µ and then provide bounds on P(B4c ). The way this latter step is carried out is by giving estimates for P(Ac1 ∪ Ac2 ), P(B1c ∪ B2c ) and P(B3c ). This is the content of Steps II–IV. 30 Step II: We claim that, if γ > 0, then P(Ac1 ∪ Ac2 ) ≤ 2γ, provided N , q1 , q2 are chosen such that kP∆ U ∗ PN U P∆ − P∆ k ≤ and 1 p 1/2 4 log2 (4θ−1 |∆|) , p q1 = q2 ≥ C · υ 2 (U ) · |∆| · log γ −1 + 1 · log θ−1 |∆| , (11.15) (11.16) for some universal constant C > 0. Also, if q1 = q2 = 1, then P(Ac1 ∪ Ac2 ) = 0. To deduce the claim, we first observe that by Proposition 6.2 these requirements are well defined. Now note that Proposition 11.2 gives, for i = 1, 2, t, γ > 0 and η ∈ H, that P qi−1 P∆ U ∗ PΩi U P∆ − P∆ η > (t + kP∆ U ∗ PN U P∆ − P∆ k) kηk ≤ γ, (11.17) if qi (1 − qi )−1 ≥ 4t−2 · υ 2 (U ) · |∆|, (11.18) and 2K t max{qi−1 − 1, 1} ≥ log 1 + max{qi−1 − 1, 1} · υ 2 (U ) · |∆| · log Kγ −1 , t 2qi−1 (1 − qi ) (11.19) where K is the constant in Talagrand’s Theorem 14.2. Thus, by (11.17), (11.18) and (11.19) (and a small computation using Taylor’s Theorem), we can choose t = αi /2 and deduce the first assertion in Step II. As for the second assertion, clearly, if q1 = q2 = 1 then (11.18) and (11.19) are satisfied for all γ > 0. Thus the left-hand side of (11.17) must be zero, and hence the last assertion follows. Step III: We claim that, for γ > 0, then P(B1c ∪ B2c ) ≤ 2γ, if N , q1 and q2 are chosen such that 1 ⊥ ∗ , kPM P∆ U PN U P∆ kmr ≤ p 8 |∆| (11.20) q1 = q2 ≥ C · υ 2 (U ) · |∆| · log γ −1 M + 1 , (11.21) and for some universal constant C > 0. Also, if q1 = q2 = 1, then P(B1c ∪ B2c ) = 0. To prove the claim, recall that Proposition 11.1 gives, for i = 1, 2 and t, γ > 0, that ⊥ ⊥ ∗ EΩi P∆ η l∞ > (t + kPM P∆ U PN U P∆ k)kηk ≤ γ, η ∈ H, P qi−1 PM P∆ if qj ≥ ! √ 4 2 2p 4 c + |∆| · log |∆ ∩ {1, . . . , M }| · υ 2 (U ). t2 3t γ Choosing t = βi /2 automatically yields the first assertion in Step III. Also, the fact that P(B1c ∪ B2c ) = 0, q1 = q2 = 1, follows automatically from Proposition 11.1. Step IV: We claim that, for γ > 0, then P(B3c ) ≤ γ, if µ, (recall µ and ν from Step I) N and {q3 , . . . , qµ } are chosen such that µ = d4(log(γ −1 ) + ν)e, (11.22) kP∆ U ∗ PN U P∆ − P∆ k ≤ 1/4, and ⊥ ∗ kPM P∆ U PN U P∆ kmr ≤ p log2 (4θ−1 |∆|) p , 8 |∆| (11.23) (11.24) and also q3 = q4 = . . . = qµ = q, where 2 q ≥ C · υ (U ) · |∆| · ! log (M ) p +1 , log2 (4θ−1 |∆|) 31 (11.25) for some universal constant C > 0. Also, if q3 = q4 = . . . = qµ = 1, then P(B3c ) = 0. To prove the claim we start by determining the condition (11.22) on µ. Define the random variables X1 , . . . Xµ−2 by ( 0 Zj+2 6= Zj+1 , Xj = 1 Zj+2 = Zj+1 . We immediately observe that P(B3c ) = P(|Θµ | < ν) = P(X1 + . . . + Xµ−2 > µ − ν). Let P > 0 (a specific value for P will be assigned later) be such that P ≥ P(Xj = 1), j = 1, . . . , µ − 2, e1 , . . . , X eµ−2 be independent binary random variables, with P(X ek = 1) = P and P(X ek = 0) = and let X 1 − P for each k. Then, e1 + . . . + X eµ−2 > µ − ν . P (|Θµ | < ν) ≤ P X Note that by the standard Chernoff bound ([34, Theorem 2.1]) it follows that, for t > 0, e1 + . . . + X eµ−2 ≥ (µ − 2)(t + P ) ≤ e−2(µ−2)t2 . P X (11.26) If we let t = (µ − ν)/(µ − 2) − P , then (11.26) and some easy calculations give, for γ > 0, that e1 + . . . + X eµ−2 > µ − ν ≤ γ, P X whenever µ ≥ x, (x − 2) x−ν −P x−2 2 − log(γ −1/2 ) = 0, (11.27) and x is the largest root satisfying the right equation of (11.27). In particular, we have shown that P(B3c ) ≤ γ when (11.27) is satisfied. We now need to determine a P > 0 and find conditions such that the random variables X1 , . . . , Xµ−2 have P(Xk = 1) ≤ P for each k. We choose P = 1/2. This choice of P gives (by (11.27)) x ≤ 4(ν + log(γ −1/2 )), hence (11.22) yields (11.27). For the rest of the proof of Step IV we need to determine the conditions on N and {q3 , . . . , qµ } such that P(Xk = 1) ≤ P for each k = 1, . . . , µ − 2. Note that Xk = 1 if and only if one of the following events occur: D1 : P∆ − qj−1 P∆ EΩj P∆ Zj−1 > αj kZj−1 k , j = k + 2, −1 (11.28) ⊥ D2 : q PM P∆ EΩj P∆ Zj−1 > βj kZj−1 k, j = k + 2. j l∞ Observe that we may argue exactly as in the proof of Step II (via Proposition 11.2) and deduce that P(D1 ) ≤ 1/4 when N and qj , are chosen such that kP∆ U ∗ PN U P∆ − P∆ k ≤ αj /2, qj ≥ C · υ 2 (U ) · |∆| · αj−2 · (log (4) + 1) , j = k + 2, (11.29) for some universal constant C > 0. Observe also that we may argue exactly as in the proof of Step III (via Proposition 11.1) and deduce that P(D2 ) ≤ 1/4 when N and qj are chosen such that ⊥ ∗ kPM P∆ U PN U P∆ kmr ≤ βj /2, ! 1 1p 2 + |∆| · (log (4M ) + 1) , qj ≥ C · υ (U ) · βj2 βj 32 j = k + 2, (11.30) for some universal constant C > 0. Thus, P(Xk = 1) ≤ P(D1 ∪ D2 ) ≤ P whenever (11.29) and (11.30) are satisfied. But (11.29) and (11.30) follow from (11.23), (11.24) and (11.25) (with a possibly different, however universal, C) and thus, the first part of the claim is proved. The fact that if q3 = q4 = . . . = qµ = 1 then P(B3c ) = 0 follows from Propositions 11.2 and 11.1. Step V: We claim that for γ > 0, then ⊥ P kP∆ ρ − sgn(x0 )k > θ/8 ∪ kPM P∆ ρkl∞ > 1/2 ≤ 5γ, (11.31) when N ∈ N and θ > 0 are chosen according to (6.3) and p θ ≥ C · υ 2 (U ) · |∆| · log γ −1 + 1 · log M θ−1 |∆| , (11.32) for some universal constant C > 0. Also, if θ = 1 then the left hand side of (11.31) is equal to zero. To prove this, recall the events A1 , A2 , B1 , B2 , B3 , B4 from Step I. We have already established in Step ⊥ I that if the event B4 occurs then kP∆ ρ − sgn(x0 )k ≤ θ/8 and kPM P∆ ρkl∞ ≤ 1/2. It therefore suffices to show that, given the conditions (6.3), (6.4) and (11.32), it holds that P (B4c ) ≤ 5γ. (11.33) To do this we begin by making some observations. First P (B4c ) ≤ P(Ac1 ∪ Ac2 ) + P(B1c ∪ B2c ) + P(B3c ), (11.34) q1 + q2 + . . . + qµ ≥ θ. (11.35) and second P(Ac1 Ac2 ) Recall from Step II we have that ∪ ≤ 2γ whenever (11.15) and (11.16) are satisfied. Also, by Step III, P(B1c ∪ B2c ) ≤ 2γ whenever (11.20) and (11.21) are fulfilled. Finally, from Step IV we have that P(B3c ) ≤ γ provided l l p mm (11.36) µ = 4 log(γ −1 ) + log2 8θ−1 |∆| and (11.23), (11.24) and (11.25) are satisfied. In particular, using (11.34) we find that (11.33) follows from (11.15), (11.16), (11.20), (11.21), (11.23), (11.24) and (11.25). We must then show that these equations follow from (6.3) and (11.32). Now let q1 = q2 = θ/4. Then, by (11.32), we have that (11.16) follows (with a possibly different constant), and similarly (11.21) follows. Let q = q3 = . . . = qµ . By (11.35) and (11.36) we have l l p mm 8q log(γ −1 ) + log2 8θ−1 |∆| ≥ θ, and hence (11.25) follows. The only thing left to do is to deal with the requirements on N . In particular, we need to show that (11.15), (11.20), (11.23) and (11.24) follow when (6.3) is satisfied. Note that (11.23) and (11.24) are weaker than (11.15) and (11.20). Thus, we only need to concentrate on (11.15) and (11.20). To see that (6.3) and (6.4) imply (11.15) and (11.20), note that (since PM ≥ P∆ ) P∆ U ∗ PN U P∆ − P∆ = P∆ (PM U ∗ PN U PM − PM )P∆ , and so kP∆ U ∗ PN U P∆ − P∆ k ≤ kPM U ∗ PN U PM − PM k. Hence (11.15) follows from (6.3). The fact that (11.20) follows from (6.4) is clear. Also, the fact that the left-hand side of (11.31) is equal to zero when θ = 1 follows from Steps II - IV and the fact that when θ = 1 we have q1 = . . . = qµ = 1. Step VI: We claim that, for γ > 0, P(kθ−1 P∆ U ∗ PΩ U P∆ − P∆ k > 1/2) ≤ γ, when N ∈ N and θ > 0 are chosen such that kP∆ U ∗ PN U P∆ − P∆ k ≤ 1/4, θ ≥ C · υ 2 (U ) · |∆| · log γ −1 |∆| + 1 , 33 (11.37) for some universal constant C. Also, if θ = 1 then the left hand side of (11.37) is equal to zero. To prove this claim note that, by Theorem 11.3, there is a K > 0 such that 1 P θ−1 (PΩ U P∆ )∗ PΩ U P∆ − P∆ ≥ + kP∆ U ∗ PN U P∆ − P∆ k ≤ γ, 4 provided θ ≥ 4K · υ 2 (U ) · |∆| · log(|∆|), and θ ≥ 4K · υ 2 (U ) · |∆| · log(Cγ −1 ) · log 1 + 4 (1 − θ) −1 . This yields the asserted claim. The fact that the left hand side of (11.37) is equal to zero when θ = 1 is clear. Step VII: In this final step we will patch the different parts of the proof together. Recall that our initial goal was to show that (11.7) follows with probability exceeding 1 − . Note that in Step V we have shown that if γ > 0, then (ii) and (iii) in (11.7) are satisfied with probability exceeding 1 − 5γ, provided (6.3) and (11.32) is satisfied. We are thus only left to show that (i) follows with a certain probability. However, we immediately recognize that the conditions in Step VI follow from (6.4) and (11.32), and hence (i) in (11.7) follows with probability exceeding 1 − γ. This implies that (i), (ii) and (iii) in (11.7) hold with probability exceeding 1 − 6γ. By choosing γ such that 6γ = we observe that (11.32) follows (with possibly a different C) from the conditions in Theorems 7.1 and 7.2 and we have finally proved the first assertions in Theorem 7.1 and Theorem 7.2. The last assertions follow by the fact that θ = 1 when m = N , (and hence also q1 = . . . = qµ = 1) and Step V - VI. Proof of Theorem 7.3. We will follow the recipe from the of proof of Theorem 7.2 almost word for word, and we will only point out where the differences lie. The first such difference is the set of conditions provided by Proposition 10.4. In particular we must show that there exists a ρ ∈ ran(U ∗ PΩ ) such that (i) kθ−1 P∆ U ∗ PΩ U P∆ − P∆ k ≤ 1/2, (ii) kP∆ ρ − sgn(x0 )k ≤ θ/8 ⊥ (iii) kP∆ ρkl∞ ≤ 1/2, (11.38) is true with probability exceeding 1 − . (Note that only condition (iii) is changed from the proof of Theorem 7.2). Step I: Almost as in the proof of Theorem 7.2, except that (11.9) should read −1 Θi−1 ∪ {i} if P∆ − qi P∆ EΩi P∆ Zi−1 ≤ αi kZi−1 k , −1 ⊥ Θi = and qi P∆ EΩi P∆ Zi−1 l∞ ≤ βi kZi−1 k, Θi−1 otherwise, and the events B1 and B2 in (11.10) should be −1 ⊥ q P∆ EΩj P∆ Zj−1 ≤ βj kZj−1 k, Bj : j l∞ j = 1, 2. Also, (11.13) must be changed to ⊥ kP∆ ρkl∞ ≤ ν X ⊥ ∞ kqτ−1 (i) P∆ EΩτ (i) Zτ (i−1) kl i=1 ≤ ν X βτ (i) kZτ (i−1) k ≤ i=1 ν i X Y p |∆| βτ (i) ατ (j) . i=1 j=1 Step II: Exactly as in the proof of Theorem 7.1. Step III: We claim that, for γ > 0, then P(B1c ∪ B2c ) ≤ 2γ, if N , q1 and q2 are chosen such that 1 ⊥ ∗ kP∆ U PN U P∆ kmr ≤ p , 8 |∆| (11.39) q1 = q2 ≥ C · υ 2 (U ) · |∆| · log γ −1 ω1 + 1 , (11.40) and 34 where p ω1 = ω̃M,U (|∆|, q1 (8 |∆|)−1 , N ), (recall ω̃M,U from (6.2)) for some universal constant C > 0. Also, if q1 = q2 = 1, then P(B1c ∪ B2c ) = 0. The claim follows exactly as in the proof of Step III in the proof of Theorem 7.1 by using the last part of Proposition 11.1. Step IV: We claim that, for γ > 0, then P(B3c ) ≤ γ, if µ, (recall µ and ν from Step I) N and {q3 , . . . , qµ } are chosen according to (11.22), (11.23) and p log2 (4θ−1 |∆|) ⊥ ∗ p , (11.41) kP∆ U PN U P∆ kmr ≤ 8 |∆| and also that q3 = q4 = . . . = qµ = q, where 2 q ≥ C · υ (U ) · |∆| · and ω2 = ω̃M,U ! log (ω2 ) p +1 , log2 (4θ−1 |∆|) (11.42) ! p log2 (4θ−1 |∆|) p ,N , |∆|, q 8 |∆| (recall ω̃M,U from (6.2)) for some universal constant C > 0. Also, if q3 = q4 = . . . = qµ = 1, then P(B3c ) = 0. The proof is almost as in the proof of Theorem 7.2, except that the last part of (11.28) should read ⊥ EΩj P∆ Zj−1 l∞ > βj kZj−1 k, D2 : qj−1 P∆ j = k + 2, and (11.30) should be ⊥ ∗ kP∆ U PN U P∆ kmr ≤ βj /2, ! p 1 1 qj ≥ C · υ 2 (U ) · + |∆| · (log (4ω2 ) + 1) , βj2 βj j = k + 2. Step V: We claim that, for γ > 0, ⊥ P kP∆ ρ − sgn(x0 )k > θ/8 ∪ kP∆ ρkl∞ > 1/2 ≤ 5γ, when N ∈ N and θ > 0 are chosen according to (6.3), (6.5) and p θ ≥ C · υ 2 (U ) · |∆| · log γ −1 + 1 · log ωθ−1 |∆| , where ω = ω̃M,U (|∆|, s, N ), s= (11.43) (11.44) θ p , 32 |∆| log(e4 γ −1 ) and ω̃M,U is defined in (6.2), for some universal constant C > 0. Also, if θ = 1 then the left hand side of (11.43) is equal to zero. The strategy is almost as in the proof of Step V in Theorem 7.1. In particular, we argue by using Step II - IV that P (B4c ) ≤ 5γ when (11.15), (11.16), (11.39), (11.40), (11.23), (11.41) and (11.42) are satisfied, and thus (11.43) follows. We then need to show that these equations follow from (6.3), (6.5) and (11.44). To do this, let q1 = q2 = θ/4. Then, by (11.44), we have that (11.16) follows (with a possibly different constant). To show that (11.40) is implied by (11.44) it suffices to show that ω ≥ ω1 . This will follow by the definition (6.2) of ω̃M,U (recall that the mapping s 7→ ω̃M,U (|∆|, s, N ) is a decreasing function), and by observing that p −1 p q1 (8 |∆|)−1 > s = θ 32 |∆| log(e4 γ −1 ) . To show that (11.42) follows from (11.44) it suffices to show that ω ≥ ω2 . To do this (as argued above) it is sufficient to prove that p log2 (4θ−1 |∆|) p q ≥ s. (11.45) 8 |∆| 35 To see why the latter inequality is true, note that q1 + q2 + . . . + qµ ≥ θ. So, by recalling the value of µ (from (11.22)) from Step IV and noting that q = q3 = . . . = qµ , we get l l p mm 8q log(γ −1 ) + log2 8θ−1 |∆| ≥ θ. In particular, it follows that q log2 (4θ−1 p |∆|) ≥ θ log2 (4θ−1 p |∆|) θ p > . 4 log(e4 γ −1 ) |∆|) + 1) 8(log(γ −1 ) + log2 (8θ−1 (11.46) Thus, we have shown (11.45). We are now left with the task of showing that (11.15), (11.39), (11.40), (11.23) and (11.41) follow from (6.3) and (6.5), and this follows by arguing exactly as in the proof of Step V in the proof of Theorem 7.1 Step VI and Step VII: Exactly as in the proof of Theorem 7.1. 12 Conclusions and future work We have presented a theoretical framework that allows for CS in infinite dimensions (see also [30, 37] for related ideas). As a result of the theorems proved, one can reconstruct arbitrary signals in separable Hilbert spaces from their measurements, and when the additional structure of sparsity is present, one can also dramatically undersample. In developing this theory, we have extended current finite-dimensional CS to an infinite-dimensional signal model. This paper marks the first foray into a very large topic – infinite-dimensional CS – which brings with it an abundance of new challenges and open problems. We now briefly describe a number of such issues. Noisy data. In this paper we have assumed that the measurements ζ = U x0 are noiseless. However, in practice one always encounters noise, and therefore, rather than solving the equality-constrained l1 -minimization problem min kηkl1 subject to PΩ U η = PΩ ζ, (12.1) 1 η∈l (N) one instead considers min kηkl1 subject to kPΩ U η − PΩ ζk < , η∈l1 (N) (12.2) where is appropriately chosen according to the noise level. Even in finite dimensions, analyzing (12.2) is significantly more involved, and requires different techniques to those used for (12.1). In particular, in finite dimensions the analysis of (12.2) is usually carried out via the Restricted Isometry Property (RIP) [16]. When analyzing (12.2) in infinite dimensions we are faced with exactly the same questions as those addressed in Theorems 7.1–7.3: namely, how large must N and m, and, of course, how large is the error (in terms of )? Unfortunately, any attempt to obtain conditions similar to those of Theorems 7.1–7.3 for (12.2) by using an RIP framework is likely to be unsuccessful, since the RIP is a very strong condition. Fortunately, recent developments in so-called RIPless CS in finite dimensions [13] indicate that such a condition need not be necessary. Developing a full theory for the infinite-dimensional noisy case (12.2) is a subject of current work. Incoherence and semi-random subsampling. The key results this paper, Theorems 7.1–7.3, demonstrate that subsampling is achievable. However, the amount of subsampling possible is limited by the incoherence parameter υ(U ). Although in many cases of interest, this number is small, meaning that one can attain a high degree of undersampling, this need not always be the case. Much as in finite-dimensional CS, there are cases for which this cannot be avoided. However, there are numerous examples in infinite dimensions where the overall incoherence υ(U ) is large, but the incoherence of ‘tail operator’ U Pk⊥ tends to zero as k → ∞. This is the phenomenon of local coherence: the first few columns of U are coherent, but the remainder a relatively incoherent. This suggests a new approach, semi-random subsampling, where one carries out full sampling on those indices corresponding to the coherent part of U , and random subsampling on the incoherent part. This is currently work in progress, see [5] for details. 36 13 Acknowledgments The authors would like to thank Akram Aldroubi, Emmanuel Candès, Holger Rauhut, Jared Tanner, Gerd Teschke, Joel Tropp, Martin Vetterli, Christopher White and Pengchong Yan for useful discussions and comments. 14 Appendix The appendix contains all the proofs that have not been displayed so far. However, before do this, there are two results that are absolutely crucial. The first is a due to Rudelson [38]. Lemma 14.1. (Rudelson) Let η1 , . . . , ηM ∈ Cn and let ε1 , . . . εM be independent Bernoulli variables taking values 1, −1 with probability 1/2. Then v ! u M M X uX p 3 t log(n) max kηi k η̄i ⊗ ηi εi η̄i ⊗ ηi ≤ E i≤M 2 i=1 i=1 Note that the original lemma in [38] does not apply in this case. Actually, we need the complex version proved in [40]. We will, however, still refer to it as Rudelson’s Lemma. One should also note that the original lemma only has a constant C in the bound. This constant has been bounded by 3/2 by Tropp in [40]. The following theorem is also indispensable [39]: Theorem 14.2. (Talagrand) There exists a number K with the following property. Consider n independent random variables Xi valued in a measurable space Ω. P Consider a (countable) class F of measurable functions on Ω. Consider the random variable Z = supf ∈F i≤n f (Xi ). Consider S = sup kf k∞ , f ∈F V = E sup X f ∈F i≤n f (Xi )2 . Then for each t > 0, we have 1 t tS P(|Z − E(Z)| ≥ t) ≤ K exp − log 1 + . KS V Proof of Proposition 9.5. Suppose that x0 is the unique minimizer of (9.6). For n ∈ N let Pn be the projection onto span{e1 , . . . , en }. Then, for all sufficiently large n, it follows that x0 is the unique minimizer to the finite-dimensional optimization problem inf{kxkl1 : x ∈ Pn H, PΩ U Pn x = PΩ U x0 }. Proposition 9.5 is well known to be true in finite dimensions. It follows that there is a yn such that, for ρn = Pn U ∗ PΩ yn , we have hρn , ej i < 1 when j ∈ / ∆ and j ≤ n, and hρn , ej i = sgn(x0 , ej i) for j ∈ ∆. We now claim that there is a constant M < ∞ such that kyn kl∞ ≤ M for all large n. First note that if we consider U ∗ PΩ and Pn U ∗ PΩ as operators from PΩ H to c0 (the set of sequences decaying at infinity) where both PΩ H and c0 are equipped with the l∞ norm, then kU ∗ PΩ − Pn U ∗ PΩ k → 0 as n → ∞. Also, it is clear (recall that U is unitary) that there is a γ > 0 such that inf ξ∈PΩ H,kξkl∞ =1 kU ∗ PΩ ξkl∞ ≥ γ, and therefore, by the operator norm convergence, there is a γ̃ > 0 such that inf{kPn U ∗ PΩ ξkl∞ : ξ ∈ PΩ H, kξkl∞ = 1} ≥ γ̃ for all sufficiently large n. And the claim follows. Now choose n so large that kPn⊥ U ∗ PΩ ξkl∞ < 1, ∀ ξ ∈ PΩ H, kξkl∞ ≤ M. 37 Then, since kyn kl∞ ≤ M we can define ρ = U ∗ PΩ yn . Then ρ = ιρn + Pn⊥ U ∗ PΩ yn , where ι : Pn H → H is the inclusion map, and thus ρ satisfies the requirements (i), (ii) and (iii). As for the other direction, suppose that (i), (ii) and (iii) are satisfied and that PΩ U P∆ has full rank. Then there is a ρ ∈ l∞ (N) such that ρ = U ∗ PΩ y for some y ∈ PΩ H and kρkl∞ ≤ 1. Also, by (ii) X Re(hPΩ U P∆ x0 , yi) = Re(hx0 , P∆ ρi) = sign(hx0 , ej i)hx0 , ej i = kx0 kl1 . j∈∆ Thus, by using duality (recall Proposition 9.3), in particular the fact that PΩ U : H → PΩ H is onto (this follows since U is unitary) and that inf{kxkl1 : PΩ U x = PΩ U x0 } = sup{Re(hPΩ U x0 , yi) : kPn U ∗ PΩ ykl∞ ≤ 1}, it follows that x0 is a minimizer. But hρ, ej i < 1 for j ∈ / ∆ so if ξ is another minimizer then supp(ξ) = ∆. However, PΩ U P∆ has full rank, so ξ = x0 . Proof of Proposition 10.2. Let α = |∆| and also ω = {ωj }α j=1 be a sequence, where ωj ∈ C. Now define ⊥ ⊥ Vω = I∆c ⊕ Sω : P∆ H ⊕ P∆ H → P∆ H ⊕ P∆ H, (14.1) ⊥ where Sω = diag({ωj }α j=1 ) on P∆ H and I∆c is the identity on P∆ H. Define U (ω) = U Vω . Note that to prove our claim it suffices to show that Vω x is the unique minimizer of inf{kηkl1 : PΩ U η = PΩ U (ω)x} for all ω, where ω ∈ Λ = {(eiθ1 , . . . , eiθα ) ∈ Cα : θj ∈ [0, 2π), 1 ≤ j ≤ α}. (14.2) Indeed, if that is the case then, by Proposition 9.5, for every ω ∈ Λ there exists ζω ∈ PΩ H such that π ω = U ∗ PΩ ζ ω , P∆ πω = sgn(Vω x), kP∆c πω kl∞ < 1. (14.3) Thus, for any y ∈ H such that supp(y) = ∆ choose ω ∈ Λ such that sgn(y) = sgn(Vω x). Then, since (PΩ U P∆ )∗ PΩ U P∆ P∆ H is invertible it follows by 14.3 and Proposition 9.5 that y is the unique minimizer of inf{kηkl1 : PΩ U η = PΩ U y}. Note also that if ω ∈ Λ then Vω is clearly unitary and also an isometry on l1 (N). Thus, it is easy to see that Vω ζ is a minimizer of inf{kηkl1 : PΩ U η = PΩ U (ω)x} if and only if ζ is a minimizer of inf{kηkl1 : PΩ U (ω)η = PΩ U (ω)x}. We will therefore consider the latter minimization problem and show that x is the unique minimizer for that for all ω ∈ Λ. To do that, it suffices, by Proposition 9.5 and the fact that U (ω) is unitary, to show that there exists a vector ρ ∈ H such that PΩc U (ω)ρ = 0, P∆ ρ = sgn(x), kP∆c ρkl∞ < 1. (14.4) Now, for > 0 (we will specify the value of later), define the function ϕ : ∪a∈Λ B(a, ) → R+ , where B(a, ) denotes the -ball around a, in the following way. Let W = I∆ ⊕ PΩc U P∆c : P∆ H ⊕ P∆c H → P∆ H ⊕ PΩc H, and define ϕ(ω) = inf{kP∆c ρkl∞ : W ρ = ι∗∆ sgn(x) ⊕ −PΩc U (ω)P∆ ι∗∆ sgn(x)}, where ι∆ : P∆ H → H is the inclusion operator. Then (14.4) is satisfied if and only if ϕ(ω) < 1. Thus, to show 14.4 we must show that ϕ(ω) < 1 for all ω ∈ Λ. Suppose for the moment that is chosen such that ϕ is defined on its domain. We will show that ϕ is continuous. It suffices to show that ϕ is continuous on B(a, ) for a ∈ Λ. Note that, by the fact that B(a, ) is open it suffices to show that ϕ is convex. To see that ϕ is convex, let ω1 , ω2 ∈ B(a, ) and t ∈ (0, 1). Let also ξ, η ∈ H such that W ξ = ι∗∆ sgn(x) ⊕ −PΩc U (ω1 )P∆ ι∗∆ sgn(x), W η = ι∗∆ sgn(x) ⊕ −PΩc U (ω2 )P∆ ι∗∆ sgn(x). Note that the existence of such vectors is guaranteed by the assumption that ϕ is defined on its domain. Now, observe that ϕ(tω1 + (1 − t)ω2 ) ≤ kP∆c (tξ + (1 − t)η)kl∞ ≤ tkP∆c ξkl∞ + (1 − t)kP∆c ηkl∞ . 38 Thus, taking infimum on the right hand side yields ϕ(tω1 + (1 − t)ω2 ) ≤ tϕ(ω1 ) + (1 − t)ϕ(ω2 ), and we have shown the assertion that ϕ is convex. Returning to the question on the domain of ϕ, note that if (PΩ U P∆ )∗ PΩ U P∆ P∆ H is invertible, then PΩ U (ω)P∆ )∗ PΩ U (ω)P∆ P∆ H is invertible if kU (ω̃) − U (ω)k is small and ω̃ ∈ Λ. Letting ρ = U (ω)∗ PΩ U (ω)P∆ ((PΩ U (ω)P∆ )∗ PΩ U (ω)P∆ P∆ H )−1 sgn(x) we get PΩc U P∆c ρ = −PΩc U (ω)P∆ sgn(x). Thus, ϕ is defined on its domain for small . Let Γ denote the subset of all ω ∈ Λ such that x is the unique minimizer of inf{kηkl1 : PΩ U (ω)η = PΩ U (ω)x}. Note that Γ is closed. Indeed, if ω ∈ Γ and {ωn } ⊂ Γ is a sequence such that ωn → ω then ω ∈ Γ. To see that, observe that since {U, Ω, ∆} is weakly f stable, it follows that for ξ ∈ H satisfying kξkl1 = inf{kηkl1 : PΩ U (ω)η = PΩ U (ω)x} we have kξ − xkl1 ≤ f (kω − ωn kl∞ ), ∀ n ∈ N. Thus, ξ = x and hence ω ∈ Γ. Note also that Γ is open. Indeed, for if ω̃ ∈ Γ then there exist ρ ∈ H such that ρ satisfies 14.4 (with ω replaced by ω̃) e.g. ϕ(ω̃) < 1. But, by continuity of ϕ it follows that ϕ is strictly less than one on a neighborhood of ω̃. Since (PΩ U P∆ )∗ PΩ U P∆ P∆ H is invertible, then it is easy to see that PΩ U (ω)P∆ )∗ PΩ U (ω)P∆ P∆ H is invertible, for all ω ∈ Λ thus it follows by Proposition 9.5 that (14.4) is satisfied for all ω ∈ Λ in a neighborhood of ω̃ and hence Γ is open. The fact that Γ is open and closed yields that either Γ = ∅ or Γ = Λ. The fact that {1, . . . , 1} ∈ Γ by assumption yields the theorem. Proof of Proposition 10.3. Let Vω and Λ be defined as in (14.1) and (14.2) respectively. Suppose that y ∈ H such that supp(y) = ∆. Then, by assumption, Vω y is the unique minimizer of inf{kηkl1 : PΩ U η = PΩ U Vω y}. Thus, by Proposition 9.5 it follows that there exists a ρω ∈ H such that PΩc U ρω = 0, P∆ ρω = sgn(Vω y), kP∆c ρω kl∞ < 1. (14.5) Let β = supω∈Λ {kP∆c ρω kl∞ }. Note that β < 1, since Λ is closed. Thus, for every y ∈ H with supp(y) = ∆ there exists ρω ∈ H satisfying (14.5) where kP∆c ρω kl∞ ≤ β. It is now easy to show that (see the proof of Lemma 2.1 in [17]) there exists a constant C > 0 (depending on β) such that, if ξ ∈ H, supp(ξ) = ∆, is the unique minimizer of inf{kηkl1 : PΩ U η = PΩ U ξ}, ζ ∈ H and x is a minimizer of inf{kηkl1 : PΩ U η = PΩ U ζ} then kP∆c xkl1 ≤ Ckξ − ζkl1 . Thus, since PΩ U P∆ (x − ξ) = PΩ U (ζ − ξ) − PΩ U P∆c x, and (PΩ U P∆ )∗ PΩ U P∆ |P∆ H is invertible, the proposition follows. Proof of Proposition 11.1. Without loss of generality we may assume that kηk = 1. Let {δj }N j=1 be random Bernoulli variables with P(δj = 1) = q. We will split the proof into two steps, where we will prove the finite-dimensional part of the proposition in Step I, and then tweak these ideas to fit the infinite-dimensional part of the proposition in Step II. Step I: We start by noting that, clearly (by using the fact that U is an isometry), we have ⊥ q −1 PM P∆ EΩ P∆ η = q −1 N X ⊥ ∗ PM P∆ U δj (ej ⊗ ej )U P∆ η j=1 =q −1 N X (14.6) ⊥ ∗ PM P∆ U (δj − q)(ej ⊗ ej )U P∆ η + j=1 39 ⊥ ∗ ⊥ PM P ∆ U PN U P∆ η. Our goal is to eventually use Bernstein’s inequality and the following is therefore a setup for that. Define, for 1 ≤ j ≤ N the random variables ⊥ ∗ Yj = q −1 PM P∆ U (δj − q)(ej ⊗ ej )U P∆ η, Xji = hq −1 U ∗ (δj − q)(ej ⊗ ej )U P∆ η, ei i, i ∈ ∆c ∩ {1, . . . , M }. Thus, by (14.6) it follows that for s > 0 we have N X −1 ⊥ ∗ ⊥ ⊥ Yj + PM P∆ U PN U P ∆ η P q PM P∆ EΩ P∆ η l∞ > s = P >s ∞ j=1 l N X X i ⊥ ∗ ⊥ Xj + hPM P∆ U PN U P∆ η, ei i > s ≤ P j=1 i∈∆c ∩{1,...,M } N X X i ⊥ ∗ U PN U P∆ kmr , Xj > s − kPM P∆ ≤ P j=1 i∈∆c ∩{1,...,M } where we have used the fact that U is an isometry and hence ⊥ ∗ ⊥ ∗ ⊥ PM P ∆ U PN U P∆ = −PM P∆ U P N U P∆ . ⊥ ∗ Thus, by choosing s = t + kPM P∆ U PN U P∆ kmr it follows that ⊥ ∗ ⊥ U PN U P∆ kmr ≤ EΩ P∆ η l∞ > t + kPM P∆ P q −1 PM P∆ X N P Xji > t . (14.7) j=1 i∈∆c ∩{1,...,M } X To get a bound on the right hand side of (14.7) we will be using Bernstein’s inequality, and in order to do that we need a couple of observations. First note that E |Xji |2 = q −2 E |hU P∆ η, (δj − q)(ej ⊗ ej )U ei i|2 = q −2 E (δj − q)2 |hU P∆ η, ej ihU ei , ej i|2 = q −1 (1 − q)|hU P∆ η, ej ihU ei , ej i|2 , i ∈ ∆c ∩ {1, . . . , M }. Thus N X E |Xji |2 ≤ q −1 (1 − q)kηk2 υ 2 (U ) = q −1 (1 − q)υ 2 (U ), i ∈ ∆c ∩ {1, . . . , M }. (14.8) p |Xji | = q −1 |(δj − q)||hη, P∆ U ∗ (ej ⊗ ej )U ei i| ≤ max{(1 − q)/q, 1}υ 2 (U ) |∆|, (14.9) j=1 Also, observe that i for 1 ≤ j ≤ N and i ∈ ∆c ∩ {1, . . . , M }. Now applying Bernstein’s inequality to Re(X1i ), . . . , Re(XN ) i i and Im(X1 ), . . . , Im(XN ) we get that ! N 2 X i t /4 p √ P Xj > t ≤ 4 exp − , (14.10) q −1 (1 − q)υ 2 (U ) + max{q −1 (1 − q), 1}υ 2 (U ) |∆|t/3 2 j=1 for all i ∈ ∆c ∩ {1, . . . , M }. Thus, by invoking (14.10) and (14.7) it follows that ⊥ ⊥ ∗ P q −1 PM P∆ EΩ P∆ η l∞ > t + kPM P∆ U PN U P∆ kmr ≤ γ when q≥ ! √ 2 2p 4 4 c + |∆| log |∆ ∩ {1, . . . , M }| υ 2 (U ) t2 3t γ 40 and the first part of the proposition follows. The fact that the left hand side of (11.3) when q = 1 is clear from (14.8) and (14.9). Step II: To prove the second part of the proposition we will use the same ideas, however, we are now ⊥ ⊥ faced with the problem that P∆ EΩ P∆ η (contrary to PM P∆ EΩ P∆ η) actually has infinitely many com⊥ ponents. This is an obstacle since the proof of the bound on PM P∆ EΩ P∆ η was based on bounding the ⊥ probability of the deviation of every component of PM P∆ EΩ P∆ η and thus, if there are infinitely many components to take care of, the task would be impossible. To overcome this obstacle we proceed as follows. Note that, just as argued in the previous case in Step I, we have that ⊥ q −1 P∆ EΩ P∆ η = N X ⊥ ∗ ⊥ Yej + P∆ U PN U P∆ η, ⊥ ∗ Yej = q −1 P∆ U (δj − q)(ej ⊗ ej )U P∆ η. (14.11) j=1 Define (as we did above) the random variables Xji = hq −1 U ∗ (δj − q)(ej ⊗ ej )U P∆ η, ei i, i ∈ ∆c . Note that we now have infinitely many Xji s, however, suppose for a moment that for every t > 0 there exists a non-empty set Λt ⊂ N such that X N |∆c \ Λt | < ∞. Xji > t = 0 (14.12) P sup i∈Λt j=1 Then, if that was the case, we would immediately get (by arguing as in Step I and using (14.11) and the assumption that kηk = 1) that ⊥ ∗ ⊥ U PN U P∆ kmr EΩ P∆ η l∞ > t + kP∆ P q −1 P∆ X N ⊥ ∗ ⊥ ⊥ ∗ Yej + P∆ U P N U P∆ η = P > t + kP∆ U PN U P∆ kmr ∞ j=1 l N X X i Xj > t , ≤ P c j=1 i∈|∆ \Λt | Thus, we could use the analysis provided above, via (14.10), and deduce that ⊥ ∗ ⊥ U PN U P∆ kmr ≤ γ EΩ P∆ η l∞ > t + kP∆ P q −1 P∆ when q≥ ! √ 4 2 2p 4 c + |∆| log |∆ \ Λ | υ 2 (U ). t t2 3t γ (14.13) Hence, if we could show the existence of Λt and provide a bound on |∆c \ Λt | we could appeal to (14.11) and (14.13) and be done. To do that, define X N ∗ ≤ tq = 1 . Λt = i ∈ / ∆ : P P U δ (e ⊗ e )U e ∆ j j j i j=1 Note that (ej ⊗ ej )U ei → 0 as i → ∞ for all j ≤ N . Thus, Λt 6= ∅. Moreover, we also automatically get that |∆c \ Λt | < ∞. Note also that (14.12) follows by the fact that Xji = hη, q −1 P∆ U ∗ δj (ej ⊗ ej )U ei i and the Cauchy-Schwartz inequality. With the existence of Λt established, we now continue with the task of estimating |∆c \ Λt | . Note that to estimate |∆c \ Λt | we need information about the location of ∆ which is not assumed. We only assume the knowledge of some M ∈ N such that PM ≥ P∆ . Thus, (although an estimate of |∆c \ Λt | would be sharper than what we will eventually obtain) we define ∗ Λ̃q (|∆|, M, t) = i ∈ N : max kPΓ1 U PΓ2 U ei k ≤ tq . Γ1 ⊂{1,...,M },|Γ1 |=|∆| Γ2 ⊂{1,...,N } 41 Note that it is straightforward to show that Λ̃q (|∆|, M, t) ⊂ Λt . Also, Λ̃q (|∆|, M, t) depends only on known quantities. Observe that, clearly, for any Γ1 ⊂ {1, . . . , M } and Γ2 ⊂ {1, . . . , N } then kPΓ1 U ∗ PΓ2 U ei k → ∞ as i → ∞. Thus, |∆c \ Λq (|∆|, M, t)| < ∞ and since Λq (|∆|, M, t) ⊂ Λt it follows that max kPΓ1 U ∗ PΓ2 U ei k > tq |∆c \ Λq (∆, t)| ≤ i ∈ N : Γ ⊂{1,...,M },|Γ |=|∆| 1 1 Γ ⊂{1,...,N } 2 and the second part of the proposition follows. The fact that the left hand side of (11.4) is zero when q = 1 is clear from (14.8) and (14.9). Proof of Proposition 11.2. Without loss of generality we may assume that kηk = 1. Let {δj }N j=1 be random Bernoulli variables with P(δj = 1) = q. Let also, for k ∈ N, ξk = (U P∆ )∗ ek . Observe that, since U is an isometry, N ∞ X X q −1 (PΩ U P∆ )∗ PΩ U P∆ = q −1 δk ξk ⊗ ξ¯k , P∆ = ξk ⊗ ξ¯k , (14.14) k=1 k=1 and 1 (PΩ U P∆ )∗ PΩ U P∆ − P∆ η q ! N X −1 ≤ (q δk − 1)ξk ⊗ ξ¯k η + k(P∆ U ∗ PN U P∆ − P∆ )ηk, (14.15) k=1 where the infinite series in (14.14) converges in operator norm.P Also, (14.15) follows directly from (14.14). N To get the desired result we first focus on getting bounds on k( k=1 (q −1 δk − 1)ξk ⊗ ξ¯k )ηk The goal is to use Talagrand’s formula, and the following is really a setup for that. In particular, let ζ ∈ H be a unit vector, and denote the mapping H 3 ξ 7→ Re(hξ, ζi) by ζ̂. Also, let F be a countable collection of unit vectors such that for any ξ ∈ H we have that kξk = supζ∈F ζ̂(ξ). Now define Z = kXk, X= N X Zk = ((q −1 δk − 1)ξk ⊗ ξ¯k )η. Zk , k=1 Observe that the following is clear (and note how this immediately gives us the setup for Talagrand’s Theorem) N ! ! N N X X X −1 Zk = sup ζ̂(Zk ). Z= (q δk − 1)ξk ⊗ ξ¯k η = sup ζ̂ ζ∈F ζ∈F k=1 k=1 To use Talagrand’s Theorem we must estimate the following quantities: ! N X 2 V = E sup ζ̂(Zk ) , S = sup kζ̂k∞ , R=E ζ∈F ζ∈F k=1 k=1 N ! X Zk . k=1 Note that for V we get the following estimate: ! N X X 2 E sup ζ̂(Zk )2 ≤ E sup q −1 δk − 1 |hξk , ζi|2 |hξk , ηi|2 ζ∈F ζ∈F k=1 ≤q −1 ≤q −1 k≤N (1 − q) X kξk k2 |hek , U P∆ ηi|2 k≤N (1 − q)υ 2 (U )|∆|, where we have used the fact that U is an isometry in the step going from the second to the third inequality. And S can be estimated as follows. Note that ζ̂(Zk ) = | q −1 δk − 1 |hξk , ζi||hξk , ηi| ≤ max{q −1 − 1, 1}υ 2 (U )|∆|, k ≤ N, (14.16) 42 thus S ≤ max{q −1 − 1, 1}υ 2 (U )|∆|, (14.17) where (14.17) is a direct consequence of (14.16). Finally, we can estimate R as follows 2 N N X X X = E Zk E(kZk k2 ) + E(hZk , Zj i) k=1 k=1 k6=j X ≤ q −1 (1 − q) kξk k2 |hek , U P∆ ηi|2 k≤N ≤q −1 (1 − q)υ 2 (U )|∆|, again using the fact that U is an isometry. Therefore, v 2 u u X u X p u ≤ Z Z E q −1 (1 − q)υ 2 (U )|∆|. E k ≤ k t k≤N k≤N With the estimates on V, S and R now established we may appeal to Theorem 14.2 and deduce that there is a constant K > 0 such that for θ > 0 it follows that ! ! N X p −1 P (q δk − 1)ξk ⊗ ξ¯k η ≥ θ + q −1 (1 − q)υ 2 (U )|∆| k=1 (14.18) θ θ max{q −1 − 1, 1} −1 2 −1 ≤ K exp − (max{q − 1, 1}υ (U )|∆|) log 1 + . K q −1 (1 − q) But by (14.15) it follows that for any r > 0, we have 1 ∗ P (PΩ U P∆ ) PΩ U P∆ − P∆ η ≥ r q ! ! N X −1 ∗ ¯ ≤P (q δk − 1)ξk ⊗ ξk η ≥ r − k(P∆ U PN U P∆ − P∆ )k . (14.19) k=1 Therefore, by appealing to (14.19) and (14.18) we obtain that for θ > 0 p 1 ∗ −1 (1 − q)υ 2 (U )|∆| + Ξ P (P U P ) P U P − P η ≥ θ + q Ω ∆ ∆ q Ω ∆ θ max{q −1 − 1, 1} θ , ≤ K exp − (max{q −1 − 1, 1}υ 2 (U )|∆|)−1 log 1 + K q −1 (1 − q) where Ξ = k(P∆ U ∗ PN U P∆ − P∆ )k. Choosing θ = t/2 yields the proposition. Proof of Theorem 11.3. The proof is quite similar to the proof of Proposition 11.2. Let {δj }N j=1 be random Bernoulli variables with P(δj = 1) = θ. Note that we may argue as in (14.14) and observe that −1 θ (PΩ U P∆ )∗ PΩ U P∆ − P∆ N X (14.20) −1 ≤ (θ δk − 1)ξk ⊗ ξ¯k + k(P∆ U ∗ PN U P∆ − P∆ )k , k=1 PN where ξk = (U P∆ )∗ ek . To get the desired result we first focus on getting bounds on k k=1 (θ−1 δk − 1)ξk ⊗ ξ¯k k. As in the proof of Proposition 11.2 the goal is to use Talagrand’s powerful inequality and the PN first step is to estimate E (kZk) , where Z = k=1 (θ−1 δk − 1)ξk ⊗ ξ¯k . Claim: We claim that E (kZk) ≤ 5 log(|∆|)θ−1 υ 2 (U )|∆|, (14.21) 43 when θ ≥ 3 log(|∆|)υ 2 (U )|∆|. To prove the claim we simply rework the techniques used in [38]. This is now standard and has also been used in [14, 40]. Since the setup here is a little different, we write the argument for the convenience of the N reader. We we start by observing that by letting δ̃ = {δ̃k }N k=1 be independent copies of δ = {δk }k=1 . Then !! N X −1 ¯ θ δ̃k − 1 ξk ⊗ ξk Eδ (kZk) = Eδ Z − Eδ̃ k=1 (14.22) !! N b X −1 , ≤ Eδ Eδ̃ Z − θ δ̃k − 1 ξk ⊗ ξ¯k k=1 {εj }N j=1 be a sequence of Bernoulli variables taking values ±1 with probaby Jensen’s inequality. Let ε = bility 1/2. Then, by (14.22), symmetry, Fubini’s Theorem and the triangle inequality, it follows that Eδ (kZk) N !!! X −1 −1 ¯ ≤ Eε Eδ Eδ̃ εk θ δk − θ δ̃k ξk ⊗ ξk k=1 N !! X ≤ 2Eδ Eε εk θ−1 δk ξk ⊗ ξ¯k . (14.23) k=1 Note that the setup in (14.23) is now ready for the use of Rudelson’s Lemma (Lemma 14.1). However, as specified before, it is the complex version that is crucial here. Now, by Lemma 14.1 we get that v ! u N N X uX p 3 Eε εk θ−1 δk ξk ⊗ ξ¯k ≤ log(|∆|)θ−1 max kξk kt θ−1 δk ξk ⊗ ξ¯k . (14.24) 1≤k≤N 2 k=1 k=1 And hence, by using (14.23) and (14.24), it follows that v u u p 2 −1 Eδ (kZk) ≤ 3 log(|∆|)θ υ (U )|∆|tEδ ! N X ξk ⊗ ξ¯k . Z + k=1 √ Thus, by using the easy calculus fact that for r > 0 then if r ≤ r + 1 we have that r < 5/3, and the fact PN that U is an isometry (so that k k=1 ξk ⊗ ξ¯k k ≤ 1), it is easy to see that the claim follows. To be able to use Talagrands formula there are some preparations that have to be done. First write Z= N X Zk = θ−1 δk − 1 ξk ⊗ ξ¯k . Zk , k=1 Clearly, since Z is self-adjoint, we have that kZk = supη∈F |hZη, ηi|, where F is a countable set of unit vectors. Let, for η ∈ F, the mapping B(H) 3 T 7→ |hT η, ηi| be denoted by η̂. Then, for k = 1, . . . , N we have η̂(Zk ) = θ−1 δk − 1 |h ξk ⊗ ξ¯k η, ηi| ≤ θ−1 kξk2 . Thus, after restricting η̂ to the ball of radius θ−1 maxk≤N kξk k2 it follows that S = sup kη̂k∞ ≤ θ−1 max kξk k2 ≤ θ−1 υ 2 (U )|∆|. k≤N η∈F (14.25) Also, note that V = E sup η̂∈F X η̂(Zk )2 ≤ E sup η̂∈F k≤N ≤ max kξk k2 θ−1 − 1 sup k≤N ≤ θ −1 η̂∈F X X 2 θ−1 δk − 1 |hξk , ηi|4 k≤N |hek , U P∆ ηi|2 k≤N − 1 max kξk k ≤ θ−1 − 1 υ 2 (U )|∆|, 2 k≤N 44 (14.26) where the third inequality follows from the fact that U is an isometry. It follows by Talagrand’s inequality (Theorem 14.2), by using the claim, (14.25) and (14.26), that there is a constant K > 0 such that for t > 0 N ! X −1 −1 2 P (θ δk − 1)ξk ⊗ ξ¯k ≥ t + 5 log(|∆|)θ υ (U )|∆| k=1 (14.27) t −1 2 tθ−1 −1 ≤ K exp − (θ υ (U )|∆|) log 1 + −1 . K (θ − 1) But by (14.20) it follows that for any r > 0, we have 1 ∗ P (PΩ U P∆ ) PΩ U P∆ − P∆ ≥ r θ ! N X ≤ P (θ−1 δk − 1)ξk ⊗ ξ¯k ≥ r − k(P∆ U ∗ PN U P∆ − P∆ )k . (14.28) k=1 Therefore, by appealing to (14.28) and (14.27) we obtain that for t > 0 1 ∗ −1 2 P (PΩ U P∆ ) PΩ U P∆ − P∆ ≥ t + 5 log(|∆|)θ υ (U )|∆| + Ξ θ tθ−1 t , Ξ = k(P∆ U ∗ PN U P∆ . − P∆ )k. ≤ K exp − (θ−1 υ 2 (U )|∆|)−1 log 1 + −1 K (θ − 1) Choosing t = 1 2γ yields the first part of the theorem. The last statement of the theorem is clear. References [1] B. Adcock and A. C. Hansen. A generalized sampling theorem for stable reconstructions in arbitrary bases. J. Fourier Anal. Appl. (accepted), 2011. [2] B. Adcock and A. C. Hansen. Reduced consistency sampling in Hilbert spaces. In Proceedings of the 9th International Conference on Sampling Theory and Applications, 2011. [3] B. Adcock and A. C. Hansen. Sharp bounds and optimality for generalised sampling in Hilbert spaces. Submitted, 2011. [4] B. Adcock and A. C. Hansen. Stable reconstructions in Hilbert spaces and the resolution of the Gibbs phenomenon. Appl. Comput. Harmon. Anal. (to appear), 2011. [5] B. Adcock, A. C. Hansen, E. Herrholz, and G. Teschke. Generalized sampling: extension to frames, ill-posed problems, and infinite-dimensional compressed sensing. Submitted, 2011. [6] A. Aldroubi. Oblique projections in atomic spaces. Proc. Amer. Math. Soc., 124(7):2051–2060, 1996. [7] W. Arveson. C ∗ -algebras and numerical linear algebra. J. Funct. Anal., 122(2):333–360, 1994. [8] T. Blu, P. L. Dragotti, M. Vetterli, P. Marziliano, and L. Coulout. Sparse sampling of signal innovations. IEEE Signal Process. Mag., 25(2):31–40, 2008. [9] A. Böttcher. Infinite matrices and projection methods. In Lectures on operator theory and its applications (Waterloo, ON, 1994), volume 3 of Fields Inst. Monogr., pages 1–72. Amer. Math. Soc., Providence, RI, 1996. [10] A. Böttcher and B. Silbermann. Introduction to large truncated Toeplitz matrices. Universitext. Springer-Verlag, New York, 1999. [11] S. Brenner and R. L. Scott. The Mathematical Theory of Finite Element Methods. Springer, 2nd edition, 2005. [12] E. J. Candès. An introduction to compressive sensing. IEEE Signal Process. Mag., 25(2):21–30, 2008. 45 [13] E. J. Candès and Y. Plan. A probabilistic and RIPless theory of compressed sensing. Submitted, 2010. [14] E. J. Candès and J. Romberg. Sparsity and incoherence in compressive sampling. Inverse Problems, 23(3):969–985, 2007. [15] E. J. Candès, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inform. Theory, 52(2):489–509, 2006. [16] E. J. Candès, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math., 59(8):1207–1223, 2006. [17] E. J. Candès and T. Tao. Near-optimal signal recovery from random projections: universal encoding strategies? IEEE Trans. Inform. Theory, 52(12):5406–5425, 2006. [18] P. G. Ciarlet. The Finite Element Method for Elliptic Problems. SIAM, 2002. [19] A. Cohen, W. Dahmen, and R. DeVore. Compressed sensing and best k-term approximation. J. Amer. Math. Soc., 22(1):211–231, 2009. [20] R. Coifman, F. Geshwind, and Y. Meyer. Noiselets. Appl. Comput. Harmon. Anal., 10(1):27–44, 2001. [21] D. L. Donoho. Compressed sensing. IEEE Trans. Inform. Theory, 52(4):1289–1306, 2006. [22] D. L. Donoho and J. Tanner. Counting faces of randomly projected polytopes when the projection radically lowers dimension. J. Amer. Math. Soc., 22(1):1–53, 2009. [23] P. L. Dragotti, M. Vetterli, and T. Blu. Sampling moments and reconstructing signals of finite rate of innovation: Shannon meets Strang–Fix. IEEE Trans. Signal Process., 55(5):1741–1757, 2007. [24] I. Ekeland and T. Turnbull. Infinite-dimensional optimization and convexity. Chicago Lectures in Mathematics. University of Chicago Press, Chicago, IL, 1983. [25] Y. Eldar. Sampling with arbitrary sampling and reconstruction spaces and oblique dual frame vectors. Journal of Fourier Analysis and Applications, 9(1):77–96, 2003. [26] Y. C. Eldar and T. Michaeli. Beyond Bandlimited Sampling. IEEE Signal Process. Mag., 26(3):48–68, 2009. [27] D. Gross. Recovering low-rank matrices from few coefficients in any basis. Submitted, 2010. [28] A. C. Hansen. On the approximation of spectra of linear operators on Hilbert spaces. J. Funct. Anal., 254(8):2092–2126, 2008. [29] A. C. Hansen. On the solvability complexity index, the n-pseudospectrum and approximations of spectra of operators. J. Amer. Math. Soc., 24(1):81–124, 2011. [30] E. Herrholz and G. Teschke. Compressive sensing principles and iterative sparse recovery for inverse and ill-posed problems. Inverse Problems, 26(12):125012, 2010. [31] A. J. Jerri. The Shannon sampling theorem – its various extensions and applications: A tutorial review. Proc. IEEE, 65:1565–1596, 1977. [32] T. W. Körner. Fourier Analysis. Cambridge University Press, 1988. [33] M. Lustig, D. L. Donoho, J. M. Santos, and J. M. Pauly. Compressed Sensing MRI. Signal Processing Magazine, IEEE, 25(2):72–82, March 2008. [34] C. McDiarmid. Concentration. In Probabilistic methods for algorithmic discrete mathematics, volume 16 of Algorithms Combin., pages 195–248. Springer, Berlin, 1998. [35] S. K. Mitter. Convex optimization in infinite dimensional spaces. In Recent advances in learning and control, volume 371 of Lecture Notes in Control and Inform. Sci., pages 161–179. Springer, London, 2008. 46 [36] A. Quarteroni and A. Valli. Numerical Approximation of Partial Differential Equations. Springer– Verlag, 1994. [37] H. Rauhut and R. Ward. Sparse legendre expansions via l1-minimization. Submitted, 2010. [38] M. Rudelson. Random vectors in the isotropic position. J. Funct. Anal., 164(1):60–72, 1999. [39] M. Talagrand. New concentration inequalities in product spaces. Invent. Math., 126(3):505–563, 1996. [40] J. A. Tropp. On the conditioning of random subdictionaries. Appl. Comput. Harmon. Anal., 25(1):1–24, 2008. [41] M. Unser. Sampling–50 years after Shannon. Proc. IEEE, 88(4):569–587, 2000. [42] M. Unser and A. Aldroubi. A general sampling theory for nonideal acquisition devices. IEEE Trans. Signal Process., 42(11):2915–2925, 1994. [43] M. Vetterli, P. Marziliano, and T. Blu. Sampling signals with finite rate of innovation. IEEE Trans. Signal Process., 50(6):1417–1428, 2002. 47

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement