# On Sketching Matrix Norms and the Top Singular Vector

```On Sketching Matrix Norms and the Top Singular Vector
Yi Li
University of Michigan, Ann Arbor
leeyi@umich.edu
Huy L. Nguyễn
Princeton University
hlnguyen@princeton.edu
David P. Woodruff
dpwoodru@us.ibm.com
Abstract
1
Sketching is a prominent algorithmic tool for processing
large data. In this paper, we study the problem of sketching
matrix norms. We consider two sketching models. The first
is bilinear sketching, in which there is a distribution over
pairs of r × n matrices S and n × s matrices T such that for
any fixed n × n matrix A, from S · A · T one can approximate
kAkp up to an approximation factor α ≥ 1 with constant
probability, where kAkp is a matrix norm. The second is
general linear sketching, in which there is a distribution over
2
linear maps L : Rn → Rk , such that for any fixed n × n
2
matrix A, interpreting it as a vector in Rn , from L(A) one
can approximate kAkp up to a factor α.
We study some of the most frequently occurring matrix
norms, which correspond to Schatten p-norms for p ∈
{0, 1, 2, ∞}. The p-th Schatten
Pr norm of a rank-r matrix
A is defined to be kAkp = ( i=1 σip )1/p , where σ1 , . . . , σr
are the singular values of A. When p = 0, kAk0 is defined to
be the rank of A. The cases p = 1, 2, and ∞ correspond to
the trace, Frobenius, and operator norms, respectively. For
bilinear sketches we show:
Sketching is an algorithmic tool for handling big data. A
linear skech is a distribution over k × n matrices S, k n, so that for any fixed vector v, one can approximate a
function f (v) from Sv with high probability. The goal
is to minimize the dimension k.
The sketch Sv thus provides a compression of v, and
is useful for compressed sensing and for achieving lowcommunication protocols in distributed models. One
can perform complex procedures on the sketch which
would be too expensive to perform on v itself, and this
has led to the fastest known approximation algorithms
for fundamental problems in numerical linear algebra
and nearest neighbor search. Sketching is also the only
technique known for processing data in the turnstile
model of data streams [22, 38], in which there is an
underlying vector v initialized to 0n which undergoes a
long sequence of additive positive and negative updates
of the form vi ← vi + ∆ to its coordinates vi . Given
an update of the form vi ← vi + ∆, resulting in a new
vector v 0 = v + ∆ · ei , we can add S(∆ · ei ) to Sv to
obtain Lv 0 (here ei is the i-th standard unit vector).
A well-studied problem in the sketching literature is approximating the frequency moments Fp (v),
which
Pn is equivalent to estimating the p-norms kvkp =
( i=1 |vi |p )1/p , for p ∈ [0, ∞]. This problem was introduced by Alon, Matias, and Szegedy [1], and many
insights in information complexity [6, 9] and sketching
[1, 3, 7, 24] were made on the path to obtaining optimal
bounds. In addition to being of theoretical interest, the
problems have several applications. The value kvk0 , by
continuity, is equal to the support size of x, also known
as the number of distinct elements [16, 17]. The norm
kvk1 is the Manhattan norm, which is a robust measure
of distance and is proportional to the variation distance
between distributions [19, 21, 28]. The Euclidean distance kvk2 is important in linear algebra problems [44],
and corresponds to the self-join size in databases [1].
1. For p = ∞ any sketch must have r · s = Ω(n2 /α4 )
dimensions. This matches an upper bound of Andoni
and Nguyen (SODA, 2013), and implies one cannot
approximate the top right singular vector v of A by
a vector v 0 with kv 0 − vk2 ≤ 21 with r · s = õ(n2 ).
2. For p ∈ {0, 1} and constant α, any sketch must have
r · s ≥ n1− dimensions, for arbitrarily small constant
> 0.
3. For even integers p ≥ 2, we give a sketch with r ·
s = O(n2−4/p −2 ) dimensions for obtaining a (1 + )approximation. This is optimal up to logarithmic
factors, and is the first general subquadratic upper
bound for sketching the Schatten norms.
For general linear sketches our results, though not optimal,
are qualitatively similar, showing that for p = ∞, k =
√
Ω(n3/2 /α4 ) and for p ∈ {0, 1}, k = Ω( n). These give
separations in the sketching complexity of Schatten-p norms
with the corresponding vector p-norms, and rule out a table
lookup nearest-neighbor search for p = 1, making progress
on a question of Andoni.
Introduction
Often one wishes to find or approximate the largest coordinates of x, known as the heavy hitters [10, 13], and
kvk∞ is defined, by continuity, to equal maxi |vi |.
In this paper, we are interested in the analogous
problem of sketching matrix norms. We study some
of the most frequently occurring matrix norms, which
correspond to Schatten p-norms for p ∈ {0, 1, 2, ∞}.
The p-th Schatten norm P
of an n × n rank-r matrix A
r
is defined to be kAkp = ( i=1 σip )1/p , where σ1 , . . . , σr
are the singular values of A. When p = 0, kAk0 is
defined to be the rank of A. The cases p = 1, 2, and ∞
correspond to the trace norm, the Frobenius norm, and
the operator norm, respectively. These problems have
found applications in several areas; we refer the reader
to [11] for graph applications for p = 0, to differential
privacy [20, 33] and non-convex optimization [8, 15] for
p = 1, and to the survey on numerical linear algebra for
p ∈ {2, ∞} [35].
In nearest neighbor search (NNS), one technique often used is to first replace each of the input objects
(points, images, matrices, etc.) with a small sketch,
then build a lookup table for all possible sketches to
support fast query time. In his talks at the Barriers
in Computational Complexity II workshop and Mathematical Foundations of Computer Science conference,
Alexandr Andoni states that a goal would be to design
a NNS data structure for the Schatten norms, e.g., the
trace or Schatten 1-norm (slide 31 of [2]). If a sketch
for a norm has small size, then building a table lookup
is feasible.
Sketching Model. We give the first formal study
of the sketching complexity of the Schatten-p norms.
A first question is what does it mean to have a linear
sketch of a matrix, instead of a vector as is typically
studied. We consider two sketching models.
The first model we consider is bilinear sketching, in
which there is a distribution over pairs of r × n matrices
S and n × s matrices T such that for any fixed n × n
matrix A, from S · A · T one can approximate kAkp
up to an approximation factor α ≥ 1 with constant
probability, where kAkp is a matrix norm. The goal
is to minimize r · s. This model has been used in several
streaming papers for sketching matrices [4, 14, 23], and
as far as we are aware, all known sketches in numerical
linear algebra applications have this form. It also has
the advantage that SAT can be computed quickly if S
and T have fast matrix multiplication algorithms.
The second model is more general, which we dub
general linear sketching, and interprets the n × n matrix
2
A as a vector in Rn . The goal is then to design a
2
distribution over linear maps L : Rn → Rk , such that
for any fixed n × n matrix A, interpreting it as a vector
2
in Rn , from L(A) one can approximate kAkp up to
a factor α with constant probability. The goal is to
minimize k.
Previous Results. Somewhat surprisingly the
only known o(n2 ) upper bound for either model is
for p = 2, in which case one can achieve a bilinear
sketch with r · s = O(1) [23]. Moreover, the only
lower bounds known were those for estimating the pnorm of a vector v, obtained for p > 2 by setting
A = diag(v) and are of the form k = Ω(n1−2/p log n)
[5, 34, 40]. We note that the bit complexity lower
bounds of [6, 9, 43] do not apply, since a single linear
sketch (1, 1/M, 1/M 2 , 1/M 3 , . . . , 1/M n−1 )T v is enough
to recover v if its entries are bounded by M . The sketching model thus gives a meaningful measure of complexity in the real RAM model.
Thus, it was not even known if a sketching dimension of r · s = O(1) was sufficient for bilinear sketches
to obtain a constant-factor approximation to the rank
or Schatten 1-norm, or if k = Ω(n2 ) was required for
general linear sketches.
Our Results. We summarize our results for the
two sketching models in Table 1. We note that, prior
to our work, for all p ∈
/ {2, ∞}, all upper bounds in
the table were a trivial O(n2 ) while all lower bounds for
p ≤ 2 were a trivial Ω(1), while for p > 2 they were a
weaker Ω(n1−2/p log n).
For the bilinear sketching model, we have the
following results. For the spectral norm, p = ∞, we
prove an Ω(n2 /α4 ) bound for achieving a factor αapproximation with constant probability, matching an
upper bound achievable by an algorithm of [4]. This
generalizes to Schatten-p norms for p > 2, for which we
prove an Ω(n2−4/p ) lower bound, and give a matching
O(n2−4/p ) upper bound for even integers p. For odd
integers p we are only able to achieve this upper bound
if we additionally assume that A is positive semi-definite
(PSD). For the rank, p = 0, we prove an Ω(n2 ) lower
bound, showing that no non-trivial sketching is possible.
Finally for p = 1, we prove an n1−ε lower bound for
arbitrarily small constant ε > 0. Note that our bounds
are optimal in several cases, e.g., for p = ∞, for even
integers p > 2, and for p = 0.
For the general sketching model, we show the following. For the spectral norm, our bound is Ω(n3/2 /α3 ),
which although is a bit weaker than the bilinear case,
is super-linear in n. It implies, for example, that any
general sketching algorithm for approximating the top
right (or left) singular vector v of A by a vector v 0 with
kv − v 0 k2 ≤ 21 requires k = Ω(n3/2 ). Indeed, otherwise, with a sketch of O(n) dimensions one could store
gA, for a random Gaussian vector g, and together with
v 0 obtain a constant approximation to kAk∞ , which
our lower bound rules out. This bound naturally gen-
Schatten p-norm
p=∞
p>2
p=0
p=1
Bilinear sketches
Lower Bound
Upper Bound
Ω(n2 /α4 )
O(n2 /α4 ) [4]
2−4/p
Ω(n
)
O(n2−4/p ) even p
2
Ω(n )
O(n2 )
1−ε
n
for any ε > 0
O(n2 )
General Sketches
Lower Bound
Upper Bound
Ω(n3/2 /α3 )
O(n2 /α4 ) [4]
(3/2)(1−2/p)
Ω(n √
) O(n2−4/p ) even p
Ω(√n)
O(n2 )
Ω( n)
O(n2 )
Table 1: Our results for approximating the Schatten-p norm up to a constant factor (except for the p = ∞ case)
with constant probability in both the bilinear sketching and general sketching models. For the bilinear case, we
look at the minimal r · s value, while for general linear sketches we look at the minimal value of k. For p = ∞,
α ≥ 1 is the desired approximation factor.
eralizes to an Ω(n(3/2)(1−2/p) ) bound for
√ p > 2. For
p ∈ {0, 1} our bound is now a weaker Ω( n). However,
it is the first super-constant lower bound for rank and
Schatten-1 norm, which in particular rules out a naïve
table lookup solution to the NNS problem, addressing
a question of Andoni.
Our Techniques for Bilinear Sketches. A standard technique in proving lower bounds is Yao’s minimax principle which implies if there exists a distribution
on sketches that succeeds on all n × n inputs matrices
A with large probability, then for any distribution L on
inputs A, there is a fixed pair S and T of r × n and
n × s matrices, respectively, which succeeds with large
probability over A ∼ L. Moreover, we can assume the
rows of S are orthonormal, as well as the columns of T .
This is because, given SAT , we can compute U SAT V ,
where U and V are arbitrary invertible r × r and s × s
matrices, respectively. Thus, it suffices to give two distributions L1 and L2 on A for which the kAkp values
differ by a factor α w.h.p. in the two distributions, but
for any matrix S with orthonormal rows and T with
orthonormal columns, the induced distributions L01 and
L02 on SAT , when A ∼ L1 and A ∼ L2 , respectively,
have low total variation distance dT V (L01 , L02 ).
Since S has orthonormal rows and T has orthonormal columns, then if L1 and L2 are rotationally invariant distributions, then SAT is equal in distribution to
an r × s submatrix√of A. This observation already suffices to get an Ω( n) bound on r · s for all Schatten
p-norms for a fixed constant p 6= 2, √using a √
result of
Jiang [26] which shows that square o( n) × o( n) submatrices of an n × n matrix of i.i.d. Gaussians and an
n × n orthonormal matrix have o(1) variation distance.
Note that both distributions are rotationally invariant
and by the Marčenko-Pastur Law have constant factor
difference in Schatten-p norm w.h.p. We slightly generalize Jiang’s proof to show the variation distance is o(1)
for any r × s submatrix of these distributions provided
r · s < n1−ε for arbitrarily small ε > 0.
For our Ω(n2 ) bound for p = 0 and Ω(n2−4/p )
bound for p > 2, we propose the following rotationallyinvariant distributions with constant factor gap in
Schatten norm:
• For p = 0, L1 = U V T for n × n/2 i.i.d. Gaussian
U and V , while L2 = U V T + γG for the same U
and V and G an n × n i.i.d. Gaussian matrix with
variance γ ≤ 1/poly(n).
• For p > 2, L1 = G for an n × n i.i.d. Gaussian
1
uv T for the same
matrix, while L2 = G + n1/2−1/p
G and random n-dimensional Gaussian vectors u
and v.
The major difficulty is bounding dT V (L01 , L02 ). For p > 2
this amounts to distinguishing g from g + h, where g
is an r × s matrix of Gaussians and h is a random
r × s matrix with (i, j)-th entry equal to ui vj . The fact
that h is random makes the probability density function
of g + h intractable. Moreover, for each fixed h, the
variation distance of g and g+h is much larger than for a
random h, and the best lower bound we could obtain by
fixing h is Ω(n1−2/p ) which would just match the vector
lower bound (up to a log n factor). Instead of bounding
dT V (L01 , L02 ) directly, we bound the χ2 -divergence of L01
and L02 , which if o(1) implies dT V (L01 , L02 ) = o(1). This
idea was previously used in the context of sketching pnorms of vectors [5], improving the previous Ω(n1−2/p )
bound to Ω(n1−2/p log n). Surprisingly, for the Schatten
p-norm, it improves a simple Ω(n1−2/p ) bound by a
quadratic factor to Ω(n2−4/p ), which can similarly be
used to show an Ω(n2 /α4 ) bound for α-approximating
the spectral norm. One caveat is that if we were to
directly compute the χ2 -divergence between L01 and L02 ,
it would be infinite once r · s ≥ n1−2/p . We fix this by
conditioning L01 and L02 on a constant probability event,
resulting in distributions Le01 and Le02 for which the χ2 divergence is small.
For p = 0, the problem amounts to distinguishing
an r × c submatrix Q of U V T from an r × s submatrix
of U V T + γG. Working directly with the density
function of U V T is intractable. We instead provide
an algorithmic proof to bound the variation distance.
See Theorem 3.5 for details. The proof also works for
arbitrary Q of size O(n2 ), implying a lower bound of
Ω(n2 ) to decide if an n × n matrix is of rank at most
n/2 or ε-far from rank n/2 (for constant ε), showing an
algorithm of Krauthgamer and Sasson is optimal [29].
Our Algorithm. Due to these negative results,
a natural question is whether non-trivial sketching
is possible for any Schatten p-norm, other than the
Frobenius norm. To show this is possible, given an
n × n matrix A, we left multiply by an n × n matrix
G of i.i.d. Gaussians and right multiply by an n × n
matrix H of i.i.d. Gaussians, resulting in a matrix A0
of the form G0 ΣH 0 , where G0 , H 0 are i.i.d. Gaussian
and Σ is diagonal with the singular values of A on
the diagonal. We then look at cyclesPin a submatrix
n
0
of G0 ΣH 0 . The (i, j)-th entry of A0 is `=1 σ` G0i,` H`,j
.
Interestingly, for even p, for any distinct i1 , . . . , ip/2 and
distinct j1 , . . . , jp/2 ,
E[(A0i1 ,j1 A0i2 ,j1 ) · (A0i2 ,j2 A0i3 ,j2 ) · · · (A0ip/2 ,jp/2 A0i1 ,jp/2 )]
= kAkpp .
The row indices of A0 read from left to right form a
cycle (i1 , i2 , i2 , i3 , i3 , , ..., ip/2 , ip/2 , i1 ), which since also
each column index occurs twice, results in an unbiased
estimator. We need to average over many cycles to
reduce the variance, and one way to obtain these is to
store a submatrix of A0 and average over all cycles in it.
While some of the cycles are dependent, their covariance
is small, and we show that storing an n1−2/p × n1−2/p
submatrix of A0 suffices.
Our Techniques for General Linear Sketches:
We follow the same framework for bilinear sketches.
The crucial difference is that since the input A is now
viewed as an n2 -dimensional vector, we are not able to
design two rotationally invariant distributions L1 and
2
L2 , since unlike rotating A in Rn , rotating A in Rn
does not preserve its Schatten p-norm. Fortunately, for
both of our lower bounds (p > 2) and (p ∈ {0, 1}) we
can choose L1 and L2 so that L01 is the distribution
of a k-dimensional vector g of i.i.d. Gaussians. For
p > 2, L02 is the distribution of g + h, where g is as
1
(uT L1 v, . . . , uT Lk v) for random
before but h = n1/2−1/p
n-dimensional Gaussian u and v and where Li is the
i-th row of the the sketching matrix L, viewed as an
n × n matrix. We again use the χ2 -divergence to
bound dT V (L01 , L02 ) with appropriate conditioning. In
this case the problem reduces to finding tail bounds
for Gaussian
chaoses of degree 4, namely, sums of the
P
form a,b,c,d Aa,b,c,d ua u0b vc vd0 for a 4th-order array P of
n4 coefficients and independent n-dimensional Gaussian
vectors u, u0 , v, v 0 . We use a tail bound of Latała
[30], which generalizes the more familiar Hanson-Wright
inequality for second order arrays P .
For p ∈ {0, 1} we look at distinguishing an n × n
Gaussian matrix G from a matrix (G0 , G0 M ), where G0
is an n × n/2 Gaussian random matrix and M is a random n/2 × n/2 orthogonal matrix. For all constants
p 6= 2, the Schatten p-norms differ by a constant factor
in the two cases. Applying our sketching matrix L, we
have L01 distributed as N (0, Ik ), but L02 is the distribution of (Z1 , . . . , Zk ), where Zi = hAi , G0 i + hB i , G0 M i
and each Li is written as the adjoined matrix (Ai , B i )
for n × n/2 dimensional matrices Ai and B i . For each
fixed O, we can view Z as a k-dimensional Gaussian
vector formed from linear combinations of entries of G0 .
Thus the problem amounts to bounding the variation
distance between two zero-mean k-dimensional Gaussian vectors with different covariance matrices. For L01
the covariance matrix is the identity Ik , while for L02 it
is Ik +P for some perturbation matrix P . We show that
with constant probability over M , the Frobenius
norm
√
kP kF is small enough to give us an k = Ω( n) bound,
and so it suffices to fix M with this property. One may
worry that fixing M reduces the variation distance—
√
in this case one can show that with k = O( n), distributions L01 and L02 already have constant variation
distance.
We believe our work raises a number of intriguing
open questions.
Open Question 1: Is it possible that for every odd
integer p < ∞, the Schatten-p norm requires k = Ω(n2 )?
Interestingly, odd and even p behave very differently
since for even p, we have kAkp = kA2 kp/2 , where A2
is PSD. Note that estimating Schatten norms of PSD
matrices A can be much easier: in the extreme case of
p = 1 the Schatten norm kAk1 is equal to the trace of
A, which
√ can be computed with k = 1, while we show
k = Ω( n) for estimating kAk1 for non-PSD A.
Open Question 2: For general linear sketches our
lower bound for the operator norm is Ω(n3/2 /α3 ) for
α-approximation. Can this be improved to Ω(n2 /α4 ),
which would match our lower bound for bilinear sketches
and the upper bound of [4]? Using the tightness of
Latała’s bounds for Gaussian chaoses, this would either
require a new conditioning of distributions L01 and L02 ,
or bounding the variation distance without using the
χ2 -divergence.
2
Preliminaries
Notation. Let Rn×d be the set of n × d real matrices
and On the orthogonal group of degree n (i.e., the set
of n × n orthogonal matrices). Let N (µ, Σ) denote
the (multi-variate) normal distribution of mean µ and
covariance matrix Σ. We write X ∼ D for a random
variable X subject to a probability distribution D. We
also use On to denote the uniform distribution over the
orthogonal group of order n (i.e., endowed with the
normalized Haar measure). We denote by G(m, n) the
ensemble of random matrices with entries i.i.d. N (0, 1).
For two n × n matrices
P X and Y , we define hX, Y i
as hX, Y i = tr(X T Y ) = i,j Xij Yij , i.e., the entrywise
inner product of X and Y .
Singular values and Schatten norms. Consider
a matrix A ∈ Rn×d . Then AT A is√a positive semidefinite matrix. The eigenvalues of AT A are called
the singular values of A, denoted by σ1 (A) ≥ σ2 (A) ≥
· · · ≥ σd (A) in decreasing order. Let r = rank(A). It
is clear that σr+1 (A) = · · · = σd (A) = 0. The matrix
A also has the following singular value decomposition
(SVD) A = U ΣV T , where U ∈ On , V ∈ Od and Σ is an
n × d diagonal matrix with diagonal entries σ1 (A), . . . ,
σmin{n,d} (A). Define
dT V (µ, ν) ≤
p
χ2 (µ||ν).
Proposition 2.2. ([25, p97]) It
holds
that
0
χ2 (N (0, In ) ∗ µ||N (0, In )) ≤ Eehx,x i − 1, where
x, x0 ∼ µ are independent.
In the case of n = 1, if F (x) and G(x) are the cumulative distribution functions of µ and ν, respectively,
the Kolmogorov distance is defined as
dK (µ, ν) = sup |F (x) − G(x)|.
x
It follows easily that for continuous and bounded f ,
Z
Z
(2.1)
f dµ − f dν ≤ kf k∞ · dK (µ, ν).
If both µ and ν are compactly supported, it suffices
to have f continuous and bounded on the union of the
supports of µ and ν.
1
!
Hanson-Wright Inequality. Suppose that µ is a
p
r
X
distribution
over R. We say µ is subgaussian if there
p
, p>0
kAkp =
(σi (A))
exists
a
constant
c > 0 such that Prx∼µ {|x| > t} ≤
i=1
1−ct2
e
for
all
t
≥
0.
then kAkp is a norm over Rn×d , called the p-th Schatten
The following form of the Hanson-Wright inequality
norm, over Rn×d for p ≥ 1. When p = 1, it is also called on the tail bound of a quadratic form is contained in
the trace norm or Ky-Fan norm. When p = 2, it is the proof of [41, Theorem 1.1] due to Rudelson and
exactly the Frobenius norm kAkF , recalling that σi (A)2 Vershynin.
are the eigenvalues of AT A and thus kAk2F = tr(AT A).
Let kAk denote the operator norm of A when treating Theorem 2.1. (Hanson-Wright inequality) Let
A as a linear operator from `d2 to `n2 . Besides, it holds u = (u1 , . . . , un ), v = (v1 , . . . , vn ) ∈ Rn be random
that limp→∞ kAkp = σ1 (A) = kAk and limp→0+ kAkp = vectors such that u1 , . . . , un , v1 , . . . , vn are i.i.d. symrank(A). We define kAk∞ and kAk0 accordingly in this metric subgaussian variables. Let A ∈ Rn×n be a fixed
limit sense.
matrix, then
Finally note that A and AT have the same non-zero
t
t2
T
T
singular values, so kAkp = kAT kp for all p.
,
Pr{|u Av−Eu Av| > t} ≤ 2 exp −c min
kAk kAk2F
Distance between probability measures. Suppose µ and ν are two probability measures over some for some constant c > 0, which depends only on the
Borel algebra B on Rn such that µ is absolutely contin- constant of the subgaussian distribution.
uous with respect to ν. For a convex function φ : R → R
such that φ(1) = 0, we define the φ-divergence
Latała’s Tail Bound. Suppose that gi1 , . . . , gid
Z are i.i.d. N (0, 1) random variables. The following redµ
Dφ (µ||ν) = φ
dν.
sult, duePto Latała [30], bounds the tails of Gaussian
dν
chaoses
ai1 · · · aid gi1 · · · gid . The Hanson-Wright inIn general Dφ (µ||ν) is not a distance because it is not equality above, when restricted to Gaussian random
variables, is a special case (d = 2) of this tail bound.
symmetric.
The total variation distance between µ and ν, The proof of Latała’s tail bound was later simplified by
denoted by dT V (µ, ν), is defined as Dφ (µ||ν) for φ(x) = Lehec [32].
Suppose that A = (ai )1≤i1 ,...,id ≤n is a finite multi|x − 1|. It can be verified that this is indeed a distance.
2
indexed
matrix of order d. For i ∈ [n]d and I ⊆ [d],
The χ -divergence between µ and ν, denoted by
2
2
define
i
For disjoint nonempty subsets
χ (µ||ν), is defined as Dφ (µ||ν) for φ(x) = (x − 1) or
I = (ij )j∈I .
2
I
,
.
.
.
,
I
⊆
[d]
define
φ(x) = x − 1. It can be verified that these two choices 1
k
of φ give exactly the same value of Dφ (µ||ν).
(
X
(1)
(k)
kAkI1 ,...,Ik = sup
ai xiI · · · xiI :
1
k
Proposition 2.1. ([45, p90]) It
holds
that
i
X
iI1
(1)
xiI
1
2
≤ 1, . . . ,
X
iIk
(1)
xiI
k
2


≤1 .

Also denote by S(k, d) the set of all partitions of
{1, . . . , d} into k nonempty disjoint sets I1 , . . . , Ik . It
is easy to see that if a partition {I1 , . . . , Ik } is finer
than another partition {J1 , . . . , J` }, then kAkI1 ,...,Ik ≤
kAkJ1 ,...,J` .
Theorem 2.2. For any t > 0 and d ≥ 2,


d

X Y
(j)
ai
Pr gij ≥ t


j=1
i
(
k2 )
t
,
≤ Cd exp −cd min
min
1≤k≤d (I1 ,...,Ik )∈S(k,d) kAkI1 ,...,Ik
where Cd , cd > 0 are constants depending only on d.
Distribution of Singular Values We shall need
the following two lemmata.
Theorem 3.1. Suppose that p > 2, ζ ∈ (0, 1) and
rn , sn ≥ 4. Whenever rn sn ≤ cζ 2 n2(1−2/p) , it holds that
dT V (L1 , L2 ) ≤ ζ + 0.009, where c > 0 is an absolute
constant.
Proof. For simplicity we write rn and sn as r and
s, respectively. The distribution L1 is identical to
N (0, Irs ), and the distribution L2 is a Gaussian mixture
N (z, Irs ) with shifted mean z ∈ Rrs , where
zij = 5n−1/2+1/p ui vj
1 ≤ i ≤ r, 1 ≤ j ≤ s.
For a given t > 0, define the truncated Gaussian
distribution, denoted by Nt (0, In ), as the marginal
distribution of N (0, In ) conditioned on the event that
2
2
kxk ≤ t, where x ∼ N (0, In ). Since
√ kxk ∼ χ (n),
the probability pn := Pr{kxk > 2 n} < 0.004 by
evaluating an integral of the p.d.f. of χ2 (n) distribution.
Consider an auxiliary random vector z̃ ∈ Rrs defined as
z̃ij = 5n−1/2+1/p ũi ṽj
1 ≤ i ≤ r, 1 ≤ j ≤ s,
where (ũ1 , . . . , ũr ) is drawn from N2√r (0, Ir ) and
(ṽ1 , . . . , ṽs ) is drawn from N2√s (0, Is ). Define an auxiliary distribution Le2 as N (z̃, Irs ). It is not difficult to
Lemma 2.1. (Marčenko-Pastur Law [36, 18])
Suppose that X is a p × m matrix with i.i.d N (0, 1/p) see that
entries. Consider the probability distribution FX (x)
1
e
dT V (L2 , L2 ) ≤ max
− 1, pr + ps < 0.009.
associated with the spectrum of X T X as
1 − pr − ps
1 i : λi (X T X) ≤ x .
FX (x) =
So we need only bound dT V (L1 , Le2 ). It follows from
m
Proposition 2.1 and Proposition 2.2 that
For γ ∈ (0, 1], define a distribution Gγ (x) with density
p
dT V (L1 , Le2 ) ≤ Eehz̃1 ,z̃2 i − 1,
function pγ (x) as
p
where the expectation is taken over independent z̃1 and
(b − x)(x − a)
pγ (x) =
, x ∈ [a, b],
z̃2 , which are both identically distributed as z̃. Next we
2πγx
compute Eehz̃1 ,z̃2 i .
where
√
√
Ez̃1 ,z̃2 exp(hz̃1 , z̃2 i)
a = (1 − γ)2 , b = (1 + γ)2 .
X
−1+2/p
0 0
0 ,ṽ 0 exp
=
E
5n
ũ
ṽ
ũ
ṽ
ũ,ṽ,ũ
i
j
i j
Then when m → ∞, p → ∞ and p/m → γ ∈ (0, 1) it
i,j
X
X
holds that the expected Kolmogorov distance
(3.2) = Eũ,ṽ,ũ0 ,ṽ0 exp 5n−1+2/p
ũi ũ0i
ṽj ṽj0 .
i
j
E sup |FX (x) − Gγ (x)| = O(n−1/2 ).
P
P
x
Now let us bound E| i ũũ0 |2k . Note that | i ũũ0 | ≤
0
k2 ≤ 4r. By our assumption that r ≥ 4 it holds
Lemma 2.2. ([46]) Suppose that X ∼ G(p, m). Then kũk2 kũ2k−1
that r
≥ 4, so t2 /r ≤ t for t ≤ 4r. Applying the
−t2 /2
with probability
at least 1−e
, it holds that σ1 (X) ≤
√
√
Hanson-Wright inequality (Theorem 2.1) to the identity
p + m + t.
matrix Ir , we have that


2k Z
2k
3 Lower Bounds for Bilinear Sketches
X
 X

4r
3.1 The case of p > 2 Fix rn ≤ n and sn ≤ n.
E
ũũ0 ≤
Pr ũi ũ0i > t dt


0
Let L1 = G(rn , sn ) and L2 denote the distribution of
i
i
−1/2+1/p
T
k
Z 4r
the upper-left rn × sn block of G + 5n
uv ,
1/k
r
where G ∼ G(n, n), u, v ∼ N (0, In ) and G, u, v are
k!
≤2
e−c1 t /r dt ≤ 2
c1
0
independent.
for some absolute constant c1 > 0, where the integral
is evaluated by variable substitution. We continue
bounding (3.2) using the Taylor series as below:
Ez̃1 ,z̃2 exp(hz̃1 , z̃2 i)
P
P
2
∞
X
(5n−1+ p )2k E| i ũũ0 |2k E| i ṽṽ 0 |2k
≤
(2k)!
k=0
!k
2
∞
X
(k!)2
5n2(−1+ p ) rs
≤1+2
·
2
c1
(2k)!
k=1
∞
k
X
ζ2
≤ 1 + ζ 2,
≤1+2
3
k=1
provided that 5rs/c21 n2(1−2/p) ≤ ζ 2 /3. Therefore
kG + Xkp ≥ kXkp − kGkp ≥ (4.9 − 2.2)n1/p+1/2
≥ 1.2 · 2.2n1/p+1/2 ≥ 1.2kGkp
with high probability.
3.2 General p > 0 The following theorem is a
generalization of a result of Jiang [26]. Following his
notation, we let Zn denote the distribution of the upperleft rn × sn block of an orthonormal matrix chosen
uniformly at random from On and Gn ∼ √1n G(rn , sn ).
Theorem 3.3. If rn sn = o(n) and sn ≤ rn ≤
nd/(d+2) as n → ∞ for some integer d ≥ 1, then
limn→∞ dT V (Zn , Gn ) = 0.
dT V (L1 , L2 ) ≤ (ln Eehz̃1 ,z̃2 i )1/2 +dT V (L2 , Le2 ) ≤ ζ+0.009, Jiang’s original result restricts rn and sn to rn = o(√n)
√
as claimed. The absolute constant c in the statement and sn = o( n). We follow the general notation
can be taken as c = c21 /15.
and outline in his paper, making a few modifications
to remove this restriction. We postpone the proof to
It is not difficult to see that kGkp and kG + Appendix A.
5n1/p−1/2 uv T kp differ by a constant factor with high
The lower bound on bilinear sketches follows from
probability. The lower bound on the dimension of the Theorem 3.3, with the hard pair of rotationally invariant
bilinear sketch is immediate, using L1 and L2 as the distributions being a Gaussian random matrix versus a
hard pair of distributions and the observations that (i) random orthonormal matrix. The proof follows from
both distributions are rotationally invariant, and (ii) by Theorem 3.3 and is postponed to Appendix B.
increasing the dimension by at most a constant factor,
Theorem 3.4. Let A ∈ Rn×n and p > 0. Suppose that
one can assume that rn ≥ 4 and sn ≥ 4.
an algorithm takes a bilinear sketch SAT (S ∈ Rr×n
Theorem 3.2. Let A ∈ Rn×n and p > 2. Suppose that and T ∈ Rn×s ) and computes Y with (1 − c )kAkp ≤
p
p
an algorithm takes a bilinear sketch SAT (S ∈ Rr×n Y ≤ (1 + c )kAkp for any A ∈ Rn×n with probability
p
p
and T ∈ Rn×s ) and computes Y with (1 − cp )kAkpp ≤
|Ip −1|
3
1−η
) for
Y ≤ (1 + cp )kAkpp for any A ∈ Rn×n with probability at least 4 , where cp = 2(Ip +1) . Then rs = Ω(n
any
constant
η
>
0.
at least 3/4, where cp = (1.2p − 1)/(1.2p + 1). Then
rs = Ω(n2(1−2/p) ).
3.3 Rank (p = 0) Let S ⊂ [n]×[n] be a set of indices
Proof. It suffices to show that kGkp and kG + of an n × n matrix. For a distribution L over Rn×n , the
5n1/p−1/2 uv T kp differ by a constant factor with high entries of S induce a marginal distribution L(S) on R|S|
probability. Let X = 5n1/p−1/2 uv T . Since X is of rank as
one, the only non-zero singular value σ1 (X) = kXkF ≥
(Xp1 ,q1 , Xp2 ,q2 , . . . , Xp|S| ,q|S| ), X ∼ L.
4.9 · n1/p+1/2 with high probability, since kuv T k2F ∼
(χ2 (n))2 , which is tightly concentrated around n2 .
Theorem 3.5. Let U, V ∼ G(n, d), G ∼ γG(n, n) for
On the other hand, combining Lemma 2.1 and γ = n−14 . Consider two distributions L and L over
1
2
Lemma 2.2 as well as (2.1) with f (x) = xp on [0, 4], Rn×n defined by U V T and U V T + G respectively. Let
we can see that with probability 1 − o(1) it holds for S ⊂ [n] × [n]. When |S| ≤ d2 , it holds that
X ∼ √1n G(n, n) (note the normalization!) that
(3.5)
dT V (L1 (S), L2 (S)) ≤ C|S| n−2 + dcd ,
p
(3.3)
kXkp = (Ip + o(1)) n,
where C > 0 and 0 < c < 1 are absolute constants.
where
p
Z 4
Proof. (sketch, see Appendix D for the full proof.) We
p
(4 − x)x
(3.4)
Ip =
x2 ·
dx ≤ 2p .
give an algorithm which gives a bijection f : R|S| →
2πx
0
R|S| with the property that for all but a subset of
|S|
1/p
Hence kGkp ≤ 1.1 · Ip n1/2+1/p ≤ 1.1 · 2 · n1/2+1/p R of measure o(1) under both L1 (S) and L2 (S), the
probability density functions of the two distributions are
with high probability. By the triangle inequality,
equal up to a multiplicative factor of (1 ± 1/poly(n)).
The idea is to start with the row vectors U1 , . . . , Un
of U and V1 , . . . , Vn of V , and to iteratively perturb
them by adding γGi,j to U V T for each (i, j) ∈ S. We
find new vectors U10 , . . . , Un0 and V10 , . . . , Vn0 of n × d
matrices U 0 and V 0 so that (U 0 )(V 0 )T and U V T + γG
are equal on S. We do this in a way for which kUi k2 =
(1 ± 1/poly(n))kUi0 k2 and kVi k2 = (1 ± 1/poly(n))kVi0 k2
for all i, and so the marginal density function evaluated
on Ui (or Vj ) is close to that evaluated on Ui0 (or Vj0 ),
by definition. Moreover, our mapping is bijective, so the
joint distribution of (U10 , . . . , Un0 , V10 , . . . , Vn0 ) is the same
as that evaluated of (U1 , . . . , Un , V1 , . . . , Vn ) up to a
(1 ± 1/poly(n))-factor. The bijection we create depends
on properties of S, e.g., if the entry (U V T )i,j = hUi , Vj i
is perturbed, and more than d entries of the i-th row of
A appear in S, this places more than d constraints on
Ui , but Ui is only d-dimensional. Thus, we must also
change some of the vectors Vj . We change those Vj for
which (i, j) ∈ Q and there are fewer than d rows i0 6= i
for which (i0 , j) ∈ S; in this way there are fewer than
d constraints on Vj so it is not yet fixed. We can find
enough Vj with this property by the assumption that
|S| ≤ d2 .
Algorithm 1 The sketching algorithm for even p ≥ 4.
Input: n, > 0, even integer p ≥ 4 and A ∈ Rn×n .
1: N ← Ω(−2 )
2: Let {Gi } and {Hi } be independent n1−2/p × n
matrices with i.i.d. N (0, 1) entries
3: Maintain each Gi AHiT , i = 1, . . . , N
4: Compute Z as defined in (4.6)
5: return Z
As an aside, given that w.h.p. over A ∼ L2 in
Theorem 3.5, A requires modifying Θ(n2 ) of its entries
to reduce its rank to at most d if d ≤ n/2, this implies
that we obtain an Ω(d2 ) bound on the non-adaptive
query complexity of deciding if an n×n matrix is of rank
at most d or ε-far from rank d (for constant ε), showing
an algorithm of Krauthgamer and Sasson is optimal [29].
See Appendix E for the proof. A similar algorithm
works for odd p and PSD matrices A. See Appendix F
for details.
sketches, which can thus be implemented in the most
general turnstile data stream model (arbitrary number
of positive and negative additive updates to entries
given in an arbitrary order). The algorithm works for
arbitrary A when p ≥ 4 is an even integer.
Suppose that p = 2q. We define a cycle σ to
be an ordered pair of a sequence of length q: σ =
((i1 , . . . , iq ), (j1 , . . . , jq )) such that ir , jr ∈ [k] for all r,
ir 6= is and jr 6= js for r 6= s. Now we associate with σ
Aσ =
q
Y
Ai` ,j` Ai`+1 ,j`
`=1
where we adopt the convention that ik+1 = i1 . Let C
In the theorem above, choose d = n/2 and so denote the set of cycles. We define
rank(U V T ) ≤ n/2 while rank(G) = n with probability
N
1 X 1 X
1. Note that both distributions are rotationally invari(Gi AHiT )σ
(4.6)
Z=
ant, and so the lower bound on bilinear sketches follows
N i=1 |C|
σ∈C
immediately.
for even p.
Theorem 3.6. Let A ∈ Rn×n . Suppose that an alr×n
gorithm takes a bilinear sketch SAT (S ∈ R
and Theorem 4.1. With probability ≥ 3/4, the output Z
T ∈ Rn×s ) and computes Y with (1 − cp ) rank(A) ≤ returned by Algorithm 1 satisfies (1 − )kAkp ≤ Z ≤
p
Y ≤ (1 + cp ) rank(A) for any A ∈ Rn×n with probability (1 + )kAkp when p is even. The algorithm is a bilinear
p
at least 3/4, where cp ∈ (0, 1/3) is a constant. Then
sketch with r · s = Op (−2 n2−4/p ).
rs = Ω(n2 ).
4
Bilinear Sketch Algorithms
By the Johnson-Lindenstrauss Transform, or in fact,
any subspace embedding with near-optimal dimension
(which can lead to better time complexity [44, 12, 37,
39]), we can reduce the problem of general matrices
to square matrices (see Appendix C for details), and
henceforth we shall assume the input matrices are
square.
We present a sketching algorithm to compute a
(1+)-approximation of kAkpp for A ∈ Rn×n using linear
5
Lower Bounds for General Linear Sketches
5.1 Lower bound for p > 2 Let G ∼ G(n, n), u, v ∼
N (0, In ) be independent. Define two distributions:
L1 = G(n, n), while L2 is the distribution induced by
1
1
G + 5n− 2 + p uv T .
A unit linear sketch can be described using an n × n
matrix L. Applying this linear sketch to an n × n
matrix Q results in hL, Qi. More generally, consider
k orthonormal linear sketches (which as was argued
earlier, is equivalent to any k linear forms, since, given
a sketch, one can left-multiply it by an arbitrary matrix
to change its row space) corresponding to {Li }ki=1 with
tr(LTi Lj ) = δij .
Let X ∼ L2 and Zi = hLi , Xi. We define
1/2
k
X X
=
(Lia,b )2 (Lic,d )2 
a distribution Dn,k on Rk to be the distribution of
(Z1 , . . . , Zk ).

√
Theorem 5.1. Suppose that p > 2. For all suffi3
2
ciently large n, whenever k ≤ cn 2 (1− p ) , it holds that
dT V (N (0, Ik ), Dn,k ) ≤ 0.24, where c > 0 is an absolute
constant.
Proof. It is clear that Dn,k is a Gaussian mixture with
shifted mean
=
a,b,c,d i=1
k
Partition of size 2 and 3. The norms
√ are automatically upper-bounded by kAk{1,2,3,4} = k.
Partition of size 4.
The only partition is
{1}, {2}, {3}, {4}. We have
kAk{1},{2},{3},{4} =
k
X
sup
u,v,u0 ,v 0 ∈Sn−1
Xu,v = 5n−1/2+1/p (uT L1 v, uT L2 v, . . . , uT Lk v)T
=: 5n−1/2+1/p Yu,v .
≤
sup
u,v,u0 ,v 0
1
2
k
X
uT Li vu0T Li v 0
i=1
!
T
i 2
0 0T
i 2
huv , L i + hu v , L i
≤1
i=1
Without loss of generality we may assume √
that k ≥ 16.
follows the fact that uv T is a unit
Consider the event Eu,v = {kYu,v k2 ≤ 4 k}. Since The last inequality
2
n2
i
vector
in
R
and
L
’s are orthonormal vectors in Rn .
EkYu,v k22 = k, it follows from Markov’s inequality
Latała’s
inequality (Theorem 2.2) states that for
en,k be the marginal
√
that Pru,v {Eu,v } ≥ 15/16. Let D
t ∈ [ k, k 2 ],
distribution of Dn,k conditioned on Eu,v . Then
o
nX
0
0
A
u
u
v
v
>
t
Pr
a,b,c,d
a
c
e
b
d
dT V (Dn,k , Dn,k ) ≤ 1/8
a,b,c,d
(
)!
2
t t2 t 3 √
en,k ). Resortand it suffices to bound dT V (N (0, In ), D
≤ C1 exp −c min √ , , 1 , t
ing to χ2 divergence by invoking Proposition 2.1 and
k k k3
!
Proposition 2.2, we have that
2
t3
p
≤ C1 exp −c · 1
en,k ) ≤ EehXu,v ,Xu0 ,v0 i − 1,
dT V (N (0, In ), D
k3
where u, v, u0 , v 0 ∼ N (0, In ) conditioned on Eu,v and with no conditions imposed on u, v, u0 , v 0 . It follows that
Eu0 ,v0 . We first calculate that
Pr |hYu,v , Yu0 ,v0 i| > tEu,v Eu0 ,v0
n
X
X
2
Pr {|hYu,v , Yu0 ,v0 | > t}
(Li )ab (Li )cd ua u0b vc vd0
hXu,v , Xu0 ,v0 i = c2p n−1+ p
≤
Pr{Eu0 ,v0 } Pr{Eu,v }
a,b,c,d=1 i
!
2
X
√
3
t
0
0
=: D
Aa,b,c,d ua ub vc vd ,
≤ C2 exp −c · 1 , t ∈ [ k, k 2 ].
k3
a,b,c,d
2
where D = c2p n−1+ p and Aa,b,c,d is an array of order 4 Note that conditioned on Eu,v and Eu0 ,v0 ,
such that
|hYu,v , Yu0 ,v0 i| ≤ kYu,v k2 kYu0 ,v0 k2 ≤ 16k.
k
X
Let = 1/8, then for t ∈ [k 1/2+ , 16k], it holds that
Aa,b,c,d =
Liab Licd .
i=1
ct2/3
ct2/3
tD − 1/3 ≤ − 1/3 ,
We shall compute the partition norms of Aa,b,c,d as
k
2k
needed in Latała’s tail bound.
3
2
Partition of size 1. The only possible partition is provided that k ≤ c0 n 2 (1− p ) . Integrating the tail bound
{1, 2, 3, 4}. We have
gives that


1/2
!2
EeXu,v ,Xu0 ,v0
k
X X
Z 16k
kAk{1,2,3,4} = 
Lia,b Lic,d 
=
1
+
etD Pr{|hYu,v , Yu0 ,v0 i| > t}dt
a,b,c,d i=1
0

1/2
Z k1/2+
k
X X
j
j
=1+D
etD Pr{|hYu,v , Yu0 ,v0 i| > t}dt
=
Lia,b Lic,d La,b Lc,d 
a,b,c,d i,j=1
0
Z
16k
where N (0, Ik ) is the standard k-dimensional Gaussian
distribution.
etD Pr{|hYu,v , Yu0 ,v0 i| > t}dt
+D
k1/2+
Z
≤1+D
k1/2+
Z
etD dt + C2 D
≤e
2/3
etD−ct
/k1/3
dt
k1/2+
0
Dk1/2+
16k
Z
16k
+ C2 D
2/3
e
− ct 1/3
dt
ck 2/3
1/2+
≤ exp(Dk
) + 16C2 kD · exp −
2
!
2
cp
≤ exp
2
( 14 − 32 )(1− p
)
n
!
2
2
cc0 n(1− p )
)
0 2 12 (1− p
+ 16C2 c cp n
exp −
2
2k
k1/2+
≤ 1.01
when n is large enough. It follows immediately that
en,k ) ≤ 1/10 and thus
dT V (N (0, In ), D
Proof. The sketch can be written as a matrix Φg, where
2
Φ ∈ Rk×n /2 is a random matrix that depends on A(i,σ) ,
(i,σ)
B
and M , and g ∼ G(n2 /2, 1). Assume that Φ has
full row rank (we shall justify this assumption below).
Fix Φ (by fixing M ). Then Φg ∼ N (0, ΦΦT ). It is
known that ([27, Lemma 22])
q
dT V (N (0, ΦΦT ), N (0, Ik )) ≤ tr(ΦΦT ) − k − ln |ΦΦT |,
where λ1 , . . . , λk are the eigenvalues values of ΦΦT .
Write ΦΦT = I +
Define an event E =
n
o P.
12 k2
2
M : kP kF ≤ ζ · n . When E happens, the eigenvalq
√k
ues of P are bounded by 12
ζ · n ≤ 2/3. Let µ1 , . . . , µk
be the eigenvalues of P , then λi = 1+µi and |µi | ≤ 2/3.
Hence
v
u k
uX
T
The lower bound on the number of linear sketches
dT V (N (0, ΦΦ ), N (0, Ik )) ≤ t (µi − ln(1 + µi ))
follows immediately as a corollary.
i=1
v
u k
r
q
uX
2
12 k
Theorem 5.2. Let X ∈ Rn×n and p > 2. Suppose
t
2
2
≤
· √ ≤ ζ,
µi = kP kF ≤
that an algorithm takes k linear sketches of X and
ζ
3
n
i=1
computes Y with (1 − cp )kXkpp ≤ Y ≤ (1 + cp )kXkpp
for any X ∈ Rn×n with probability at least 3/4, where where we use that x − ln(1 + x) ≤ x2 for x ≥ −2/3.
cp = (1.2p − 1)/(1.2p + 1). Then k = Ω(n(3/2)(1−2/p) ). Therefore, when E happens, Φ is of full rank and we can
apply the total variation bound above. We claim that
5.2 General p ≥ 0 Consider a random matrix EPij2 ≤ 4/n for all i, j and thus EkP k2F ≤ 4k 2 /n, it then
(G, GM ), where G ∼ G(n, n/2) and M ∼ On/2 .
follows that Pr(E) ≥ 1−ζ/3 by Markov’s inequality and
A unit linear sketch can be described using an n × n
2
2
1
matrix L = (A, B), where A, B ∈ Rn×(n/2) such that
dT V (Dn,k , N (0, Ik )) ≤ ζ + Pr(E c ) ≤ ζ + ζ = ζ
3
3
3
kAk2F + kBk2F = 1. Applying this linear sketch to an
n×(n/2)
as
n × n matrix Q = (Q1 , Q2 ) (where Q1 , Q2 ∈ R
)
Now we show that EPij2 ≤ 4/n for all i, j. Suppose
results in hA, Q1 i + hB, Q2 i.
More generally, consider k orthonormal linear that M = (mij ). Notice that the r-th row of Φ is
X (r)
sketches (which as was argued earlier, is equivalent to
n
(r)
A
+
Bij m`j , i = 1, . . . , n, ` = 1, . . . , .
i`
any k linear forms, since, given a sketch, one can left2
j
multiply it by an arbitrary matrix to change its row
space) corresponding to {Li }ki=1 with tr(LTi Lj ) = δij .
Hence by a straightforward calculation, the inner prodNow we define a probability distribution Dn,k on uct of r-th and s-th row is
X (r) (s)
X (s) (r)
Rk . For each i write Li as Li = (A(i) , B (i) ). Then
hΦ
,
Φ
i
=
δ
+
A
B
m
+
Ai` Bij m`j
(i)
(j)
(i)
(j)
r·
s·
rs
`j
ij
i`
by orthonormality, hA , A i + hB , B i = δi,j .
(i)
(i)
i,j,`
i,j,`
Define Zi = hA , Gi + hB , GM i and Dn,k to be the
X (r) (s)
(s)
(r)
distribution of (Z1 , . . . , Zk ).
= δrs +
hA` , Bj i + hA` , Bj i m`j
dT V (N (0, In ), Dn,k ) ≤ 1/10 + 1/8 < 0.24.
Theorem 5.3. Let Dn,k be √
defined as above and ζ ∈
(0, 1). Then for k ≤ (ζ/3)3/2 n it holds that
dT V (Dn,k , N (0, Ik )) ≤ ζ,
j,`
(r)
where Ai
denotes the i-th column of A(r) . Then
Prs = tr(U M ),
where the matrix U is defined by
[4] A. Andoni and H. L. Nguyen. Eigenvalues of a matrix
in the streaming model. In SODA, 2013.
(r)
(s)
(s)
(r)
uj` = hA` , Bj i + hA` , Bj i.
[5] A. Andoni, H. L. Nguyen, Y. Polyansnkiy, and Y. Yu.
Tight lower bound for linear sketches of moments. In
Since
ICALP, 2013.
nX
X
[6] Z. Bar-Yossef, T. S. Jayram, R. Kumar, and D. Sivaku(r) 2
(s) 2
2
ujk ≤ 2
|Aik |
|Bij |
mar. An information statistics approach to data
i
X i
X
o
stream and communication complexity. J. Comput.
(s) 2
(r) 2
+
|Aik |
|Bij |
Syst. Sci., 68(4):702–732, 2004.
i
i
[7] L. Bhuvanagiri, S. Ganguly, D. Kesh, and C. Saha.
and thus
Simpler algorithm for estimating frequency moments
X
X nX
(r)
(s)
of data streams. In SODA, pages 708–713, 2006.
kU k2F ≤ 2
|Aik |2
|Bij |2 +
j,k
i
i
[8]
E. J. Candès and B. Recht. Exact matrix completion
X
X
o
(s) 2
(r) 2
via
convex optimization. Commun. ACM, 55(6):111–
|Aik |
|Bij |
i
i
119, 2012.
[9] A. Chakrabarti, S. Khot, and X. Sun. Near-optimal
≤ 2 kA(r) k2F kB (s) k2F + kA(s) k2F kB (r) k2F ≤ 2.
lower bounds on the multi-party communication complexity of set disjointness. In CCC, 2003.
We conclude that
[10] M. Charikar, K. Chen, and M. Farach-Colton. FindX
X
2
ing frequent items in data streams. In Proceedings
E[Prs
]=
u2jk E[m2kj ]+
ujk ui` E[mkj mi` ]
j,k
(j,k)6=(i,`)
of the 29th International Colloquium on Automata,
Languages and Programming (ICALP), pages 693–703,
2kU k2F
4
2X
u2jk =
≤ .
=
2002.
j,k
n
n
n
[11] H. Y. Cheung, L. C. Lau, and K. M. Leung. Graph
This completes the proof.
connectivities, network coding, and expander graphs.
In FOCS, pages 190–199, 2011.
Without loss of generality,
we can normalize our
√
[12] K. L. Clarkson and D. P. Woodruff. Low rank ap1
matrix by a factor of 1/ n. Let X ∼ √n G(n, n) and
proximation and regression in input sparsity time. In
Y = (G, GM ), where G ∼ G(n, n/2) and M ∼ On/2 . It
STOC, pages 81–90, 2013.
is not difficult to see that kXkpp and kY kpp differs by a [13] G. Cormode and S. Muthukrishnan. An improved
data stream summary: the count-min sketch and its
constant with high probability. The following theorem
applications. J. Algorithms, 55(1):58–75, 2005.
is now an immediate corollary of Theorem 5.3. The
[14]
M. S. Crouch and A. McGregor. Periodicity and cyclic
proof is postponed to Appendix H.
shifts via linear sketches. In APPROX-RANDOM,
pages 158–170, 2011.
Theorem 5.4. Let X ∈ Rn×n and p ≥ 0. Suppose that
an algorithm takes k linear sketches of X and computes [15] A. Deshpande, M. Tulsiani, and N. K. Vishnoi. Algorithms and hardness for subspace approximation. In
Y with
SODA, pages 482–496, 2011.
• (when p > 0) (1 − cp )kXkpp ≤ Y ≤ (1 + cp )kXkpp , [16] P. Flajolet and G. N. Martin. Probabilistic counting.
or
In Proceedings of the 24th Annual IEEE Symposium
on Foundations of Computer Science (FOCS), pages
• (when p = 0) (1 − cp )kXk0 ≤ Y ≤ (1 + cp )kXk0
76–82,
1983.
for any X ∈ Rn×n with probability at least 43 , where
√ cp [17] P. Flajolet and G. N. Martin. Probabilistic counting
is some constant depends only on p. Then k = Ω( n).
algorithms for data base applications. J. Comput. Syst.
Sci., 31(2):182–209, 1985.
[18] F. Götze and A. Tikhomirov. Rate of convergence in
References
probability to the marchenko-pastur law. Bernoulli,
10(3):pp. 503–548, 2004.
[1] N. Alon, Y. Matias, and M. Szegedy. The Space [19] S. Guha, P. Indyk, and A. McGregor. Sketching
information divergences. Machine Learning, 72(1-2):5–
Complexity of Approximating the Frequency Moments.
19, 2008.
J. Comput. Syst. Sci., 58(1):137–147, 1999.
[20]
M. Hardt, K. Ligett, and F. McSherry. A simple
[2] A. Andoni.
Nearest neighbor search in highand practical algorithm for differentially private data
dimensional spaces.
In the workshop:
Barrelease. In Advances in Neural Information Processing
riers in Computational Complexity II, 2010.
Systems 25, pages 2348–2356. 2012.
http://www.mit.edu/ andoni/nns-barriers.pdf.
[21]
P. Indyk. Stable distributions, pseudorandom gener[3] A. Andoni, R. Krauthgamer, and K. Onak. Streaming
ators, embeddings, and data stream computation. J.
algorithms from precision sampling. In FOCS, pages
ACM, 53(3):307–323, 2006.
363–372, 2011.
[22] P. Indyk. Sketching, streaming and sublinear-space
algorithms, 2007. Graduate course notes available at
http://stellar.mit.edu/S/course/6/fa07/6.895/.
[23] P. Indyk and A. McGregor. Declaring independence
via the sketching of sketches. In SODA, pages 737–
745, 2008.
[24] P. Indyk and D. P. Woodruff. Optimal approximations
of the frequency moments of data streams. In STOC,
pages 202–208, 2005.
[25] Y. Ingster and I. A. Suslina. Nonparametric Goodnessof-Fit Testing Under Gaussian Models. Springer, 1st
edition, 2002.
[26] T. Jiang. How many entries of a typical orthogonal
matrix can be approximated by independent normals?
The Annals of Probability, 34(4):pp. 1497–1529, 2006.
[27] A. T. Kalai, A. Moitra, and G. Valiant. Efficiently
learning mixtures of two gaussians. In Proceedings of
the 42nd STOC, pages 553–562, 2010.
[28] D. M. Kane, J. Nelson, E. Porat, and D. P. Woodruff.
Fast moment estimation in data streams in optimal
space. In Proceedings of the 43rd annual ACM symposium on Theory of computing, STOC ’11, pages 745–
754, 2011.
[29] R. Krauthgamer and O. Sasson. Property testing of
data dimensionality. In SODA, pages 18–27, 2003.
[30] R. Latała. Estimates of moments and tails of Gaussian
chaoses. Ann. Probab., 34(6):2315–2331, 2006.
[31] B. Laurent and P. Massart. Adaptive estimation of
a quadratic functional by model selection. Annals of
Statistics, 28(5):1302–1338, 2000.
[32] J. Lehec. Moments of the Gaussian chaos. In Séminaire de Probabilités XLIII, volume 2006 of Lecture
Notes in Math., pages 327–340. Springer, Berlin, 2011.
[33] C. Li and G. Miklau. Measuring the achievable
error of query sets under differential privacy. CoRR,
abs/1202.3399, 2012.
[34] Y. Li and D. P. Woodruff. A tight lower bound for
high frequency moment estimation with small error.
In RANDOM, 2013.
[35] M. W. Mahoney. Randomized algorithms for matrices
and data. Foundations and Trends in Machine Learning, 3(2):123–224, 2011.
[36] V. A. Marčenko and L. A. Pastur. Distribution of
eigenvalues in certain sets of random matrices. Mat.
Sb. (N.S.), 72 (114):507–536, 1967.
[37] X. Meng and M. W. Mahoney. Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. In STOC, pages 91–
100, 2013.
[38] S. Muthukrishnan. Data Streams: Algorithms and
Applications. Foundations and Trends in Theoretical
Computer Science, 1(2):117–236, 2005.
[39] J. Nelson and H. L. Nguyen. OSNAP: Faster numerical
linear algebra algorithms via sparser subspace embeddings. CoRR, abs/1211.1002, 2012.
[40] E. Price and D. P. Woodruff. Applications of the
Shannon-Hartley theorem to data streams and sparse
recovery. In ISIT, pages 2446–2450, 2012.
[41] M. Rudelson and R. Vershynin. Hanson-Wright inequality and sub-gaussian concentration. Manuscript.
http://arxiv.org/abs/1306.2872v2.
[42] M. Rudelson and R. Vershynin. The LittlewoodOfford problem and invertibility of random matrices.
Advances in Mathematics, 218(2):600 – 633, 2008.
[43] M. E. Saks and X. Sun. Space lower bounds for
distance approximation in the data stream model. In
STOC, pages 360–369, 2002.
[44] T. Sarlós. Improved approximation algorithms for
large matrices via random projections. In FOCS, pages
143–152, 2006.
[45] A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 1st edition, 2008.
[46] R. Vershynin. Introduction to the non-asymptotic
analysis of random matrices. In Y. C. Eldar and
G. Kutyniok, editors, Compressed Sensing: Theory and
Applications. Cambridge University Press, 2011.
Appendices
A
Proof of Theorem 3.3
Proof. First we strengthen Lemma 2.6 in [26] and in
this proof we shall follow Jiang’s notation that the we
read the upper-left p × q block despite the usage of rn
and sn in the statement of the theorem.
It holds that fs0 (s, t) = −2/(1 − 2s − t) = −2 +
−1/(d+2)
O(n
) and ft0 (s, t) = −1 + O(n1/(d+2) ). The
bounds for second-order derivatives of f (s, t) still hold.
Thus
ZZ
3kq
1
2
Bn = n
ln(1 − 2s − t)dsdt +
+O
2n
n1/(d+2)
ZZ
= n2
ln(1 − 2s − t)dsdt + o(1).
Now, we expand the Taylor series into r terms,
2
d ln(1+s+t)− (s+t)− (s+t) +· · ·+(−1)d+1 (s+t) 2
d
≤
(s + t)d+1
.
d+1
So
Z
v
Z
u
ln(1 + s + t)dsdt
0
=
0
d
X
k=1
(−1)k+1
((u + v)k+2 − uk+2 − v k+2 )
k(k + 1)(k + 2)
+ O((u + v)d+3 )
as n → ∞. Substituting u = −(p + 2)/n and v = −q/n
back into two integrals in (2.5), (2.7) becomes
Z Z
n2 v u
ln(1 + s + t)dsdt
2 0 0
d
=−
1X
1
(p+q+2)k+2 − (p+2)k+2 − q k+2
2
k(k+1)(k+2)
nk
k=1
(p + q + 2)d+3
+O
nd+1
Z
=−
v
Z
1
2
Bn +
q
X
f (λi ) → 0
i=1
−2/n
in probability as n → ∞. First, we look at the last
term. It is clear from
ln(1 + s + t)dsdt
0
tr((X T X)d+2 )
,
nd+2
where gn ∈ [0, 2) for n sufficiently large.
We want to show that
(A.1)
and
n2
2
+ g̃n
0
d
X
1
(q + 2)k+2 − 2k+2 − q k+2
k(k + 1)(k + 2)
nk
k=1
d+3 q
+O
nd+1
tr((XnT Xn )m ) =
q
X
X
X
xk1 ,i xk1 ,j1 xk2 ,j1 xk2 ,j2
i=1 j∈[q]m−1 k∈[p]m
· · · xkm−1 ,jm−2 xkm−1 ,jm−1 xkm ,jm−1 xkm ,i
that E tr((XnT Xn )m ) = O(pm
n qn ), where the constant in
!
O(·)
notation
depends
on
m
but not p and q. For any
d
k+1
X k+2 pi q k+2−i
1X
1
>
0,
Bn = −
+o(1).
2
k(k+1)(k+2) i=1
nk
i
k=1
E[(tr(XnT Xn )r+2 )2 ]
T
r+2
r+2
Pr
tr((X
X
)
>
n
≤
i k+2−i
k
n
n
Notice that for i ≥ 2, it holds that p q
/n ≤
2 n2r+4
2r+4 2 k 2
k
2r+4
T
p q /n → 0 as n → ∞. It follows that
]
E[q tr(Xn Xn )
p
q
≤
= O 2 2r+4 → 0
2 n2r+4
d
n
1 X pk+1 q
Bn = −
+ o(1).
as n → ∞. So the last term goes to 0 in probability.
2
k(k + 1)nk
k=1
For the other terms, fix > 0, we see that
This is our desired form of Lemma 2.6.
Pr | tr((XnT Xn )m ) − E tr((XnT Xn )m )| ≥ nm
Now, following the proof of Lemma 2.7, we expand
into more terms
E[q(tr((XnT Xn )2m )]
E[(tr((XnT Xn )m )2 ]
≤
≤
2
d+1
d+2
x
x
x
1
x x
2 n2m
2n2m
ln 1 −
−
·
.
= − − 2 −· · ·−
d+1
d+2
p2m q 2
n
n n
(d + 1)n
d + 2 (ξ − n)
= O 2 2m → 0
n
so
Hence
f (x) =
where
p+q+1
n−p−q−1
n−p−q−1 d+1 hence
x−
x−· · ·−
x
2
q
2n
4n
2(d+1)nd+1
X
p+q+1
d+2
E tr(X TX)
f (λi ) →
x
2n
+ gn (x) d+1 ,
i=1
n
d+1
X
n−p−q−1
E tr((X TX)j )
−
j
2jn
d+1
n (n − p − q − 1)
j=2
gn (x) = −
.
2(d + 2)(ξ − n)d+2
in probability. It suffices to show that
It is trivial to see that


d+1
X
p+q+1
n−p−q−1
1
B n +
E tr(X TX)−
E tr((X TX)j )
sup |gn (x)| ≤
.
j
2n
2jn
(1 − α)d+2
0≤x≤αn
j=2
√
√ 2
The eigenvalues are bounded by O(( p + q) ) = goes to 0. Rearrange the terms in the bracket as
O(n1−1/(d+2) ). Define Ω accordingly. Then
d X
p+q+1
1
T
j
T
j+1
q
E(tr(X
X)
)
−
E(tr(X
X)
)
X
jnj
(j +1)nj
f (λi )
j=1
i=1
=
d+1
X
n−p−q−1
p+q+1
tr(X T X) −
2n
j=2
2jnj
+
tr((X T X)j )
p+q+1
E(tr(X T X)j+1 )
(d + 1)nd+1
The last term goes to 0 as E(tr(X T X)j+1 ) = O(pj+1 q) O(d/2 ) rows while roughly maintaining the singular
and pd+1 /nd → 0. Hence it suffices to show that for values as follows. Call Φ a (d, δ)-subspace embedding
matrix if with probability ≥ 1 − δ it holds that
each k
−
p+q+1
pk+1 q
+
E(tr(X TX)k )
k(k + 1)nk
knk
1
−
E(tr(X T X)k+1 ) → 0,
(k + 1)nk
(1 − )kxk ≤ kΦxk ≤ (1 + )kxk
for all x in a fixed d-dimensional subspace. In [44], it is
proved that
Lemma C.1. Suppose that H ⊂ Rn is a d-dimensional
subspace. Let Φ be an r-by-n random matrix with
−pk+1q+(k+1)(p+q+1)E(tr(X TX)k )−kE(tr(X TX)k+1 )entries i.i.d N (0, 1/r), where r = Θ(d/2 log(1/δ)).
Then it holds with probability ≥ 1 − δ that
nk
→ 0,
(1 − )kxk ≤ kΦxk ≤ (1 + )kxk, ∀x ∈ H.
which
can
be
verified
easily
using
that
m
E tr((XnT Xn )m ) = pm
Therefore (A.1) In fact we can use more modern subspace embeddings
n qn + o(n ).
holds and the rest of the argument in Jiang’s paper [44, 12, 37, 39] to improve the time complexity, though
since our focus is on the sketching dimension, we defer a
follows.
thorough study of the time complexity to future work.
Now we are ready for the subspace embedding
B Proof of Theorem 3.4
transform on singular values, which follows from the
Proof. Choose A from D0 with probability 1/2 and from min-max principle for singular values.
On with probability 1/2. Let b ∈ {0, 1} indicate which
distribution A is chosen from, that is, b = 0 when A Lemma C.2. Let Φ be a (d, δ)-subspace embedding mais chosen from D0 and b = 1 otherwise. Note that trix. Then, with probability ≥ 1 − δ, it holds that
both D0 and O(n) are rotationally invariant, so without (1−)σi (ΦA) ≤ σi (A) ≤ (1+)σi (ΦA) for all 1 ≤ i ≤ d.
loss of generality,
we can assume that S = Ir 0
Is
Proof. [of Lemma C.2] The min-max principle for sinand T =
. Hence SAT ∼ Zn when b = 0 and
0
gular values says that
SAT ∼ Gn when b = 1. Notice that the singular
σi (A) = max min kAxk,
values of an othorgonal matrix are all 1s. Then with
x∈Si
Si
kxk2 =1
probability 1 − o(1) we have that
or
b = 0where Si runs through all i-dimensional subspace. Observe that the range of A is a subspace of dimension at
(1 − cp )n ≤ Y ≤ (1 + cp )n, b = 1
most d. It follows from Lemma C.1 that with probabilso we can recover b from Y with probability 1 − o(1), ity ≥ 1 − δ,
|I −1|
provided that cp ≤ 2(Ipp +1) .
(1 − )kAxk ≤ kΦAxk ≤ (1 + )kAxk, ∀x ∈ Rd .
Consider the event E that the algorithm’s output
The claimed result follows immediately from the minindicates b = 1. Then as in the proof of Theorem 5.4,
max principle for singular values.
1
dT V (Dn,k , N (0, Ik )) ≥ + o(1).
2
Let Ã = (ΦA, 0) with zero padding so that Ã is
On
the
other
hand,
by
Theorem
3.3, a square matrix of dimension O(d/2 ). Then by the
dT V (Dn,k , N (0, Ik )) = o(1) when rs ≤ n1−η for some preceding lemma, with probability ≥ 1 − δ, kÃkp =
η > 0. This contradiction implies that rs = Ω(n1−η ) kΦAkp is a (1 ± )-approximation of kAkp for all p > 0.
Therefore we have reduced the problem to the case of
for any η > 0.
square matrices.
C Reduction to Square Matrices for the upper
D Proof of Theorem 3.5
bound
n×d
2
Suppose that A ∈ R
(n > d). When n = O(d/ ), Proof. Suppose that |S| = k and S = {(pi , qi )}ki=1 . By
let Ã = (A, 0) with zero padding so that Ã is a square symmetry, without loss of generality, we can assume
matrix of dimension n. Then kÃkp = kAkp for all that S does not contain a pair of symmetric entries.
p > 0. Otherwise, we can sketch the matrix with Throughout this proof we rewrite L1 (S) as L1 and
L2 (S) as L2 . Now, using new notation, let us denote
(1 − cp )((Ip + o(1))n ≤ Y ≤ (1 + cp )((Ip + o(1))n,
the marginal distribution of L2 with fixed G by L2 (G).
Then
dT V (L1 , L2 ) =
=
sup
| Pr(A) − Pr(A)|
sup
Z
Pr(A) −
A⊂M(Rk ) L1
A⊂M(Rk ) L1
Z
≤
sup
A⊂M(Rk )
Rn2

≤
L2
sup
A⊂M(Rk )


Pr(A|G)p(G)dG
Rn2 L2
Pr(A) − Pr(A|G) p(G)dG
L1
L2
2. kU −Tt U kF ≤ t0 , kV −Tt V kF ≤ t0 with probability
1 − O(1/n2 + dcd ), over the randomness of U and
V . When this holds, we say that U and V are good,
otherwise we say that they are bad.
3. T−t · Tt = id.
Z Property 3 implies that Tt is bijective. Define
Pr(A) − Pr(A|G) p(G)dG+
L1
E(x) = {(U, V ) : U V T |S = x}
L2
F (δ)
Z
1. ((Tt U )(Tt V )T )pq = (U V )pq + t and for all
(p0 , q 0 ) ∈ S \ {(p, q)}, we have ((Tt U )(Tt V )T )p0 q0 =
(U V T )p0 q0 .

Pr(A) − Pr(A|G) p(G)dG

L1
L2
Egood (x) = {(U, V ) ∈ E(x) : (U, V ) is good},
Ebad (x) = {(U, V ) ∈ E(x) : (U, V ) is bad}
Then, using these three properties about Tt , as well as
the triangle inequality, and letting p(U ), p(V ) be the
(D.2)
≤ sup dT V (L1 , L2 (G)) + 2 Pr{F (δ)c },
p.d.f.’s of U and V that
G∈F (δ)
kU k2F
1
n×n
exp −
p(U ) =
where F (δ) = {G ∈ R
: |Gpi ,qi | ≤ δ, ∀i = 1, . . . , k} (D.6)
2
(2π)nd/2
and Pr{F (δ)c } is the probability of the complement
1
kV
k2F
of F (δ) under the distribution on G, and we choose
p(V ) =
exp −
2
(2π)nd/2
δ = n1/4 γ. Recalling the PDF of a Gaussian random
variable and that k ≤ n2 , it follows from a union bound we have that
that
2
2
1/2
c
−δ /(2γ )
−n
/2
−3
(A
+
te
)
(A)
−
Pr
Pr
i
(D.3) Pr{F (δ) } ≤ ke
= ke
≤n .
L1
L1
Z
Now we examine dT V (L1 , L2 (G)) with G
∈
Z
F (δ).
For notational convenience, let ξ = = p(U )p(V )dU dV −
p(U )p(V )dU dV T
T
(ξ1 , . . . , ξk )
= (Gp1 ,q1 , . . . , Gpk ,qk )
and define
E(x)
E(x−tei )
(i)
T
ξ
= (ξ1 , . . . , ξi , 0, . . . , 0) . Applying the triangle
Z
Z
inequality to a telescoping sum,
p(U )p(V )dU dV −
p(Tt U )p(Tt V )dU dV ≤
(D.4)
dT V (L1 , L2 (G))
Egood (A)
Egood (A)
Z
= sup Pr(A) − Pr (A)
L2 (G)
+
p(U )p(V ) + p(Tt U )p(Tt V )dU dV
A⊂M(Rk ) L1
Z
= sup Pr(A) − Pr(A − ξ)
≤
|p(U ) − p(Tt U )|p(V )dU dV
L1
A⊂M(Rk ) L1
Egood (A)
Z
k X
(i−1)
(i) +
p(Tt U )|p(V ) − p(Tt V )|dU dV
(D.5) ≤ sup
) − Pr(A − ξ )
Pr(A − ξ
F (δ)c
A⊂M(Rk ) i=1 L1
k
L1
where M(R ) denotes the canonical Borel algebra on
Rk .
To bound (D.5), we need a way of bounding
| PrL1 (A) − PrL1 (A + tei )| for a value t with |t| ≤ δ.
In this case, we say that we perturb a single entry
(p, q) := (pi , qi ) of U V T by t while fixing the remaining
k − 1 entries. We claim that there exists a mapping
Tt : Rn×d → Rn×d (we defer the construction to the
end of the proof) for which the following three properties hold:
Egood (A)
(D.7)
+O
1
d
+ dc .
n2
Using (D.6),
(D.8)
kU k2F − kTt U k2F |p(U ) − p(Tt U )| = p(U ) · 1 − exp
.
2
Notice that kU k2F ∼ χ2 (nd), where χ2 (nd) denotes the
χ-squared distribution with nd degrees of freedom, and
so by a tail bound for the χ2 -distribution [31, Lemma 1],
kU k2F ≤ 6nd − t0 (recall that t0 = 1/n4 ) with probability
at least 1 − e−nd ≥ 1 − n−3 . When this happens, for
good U it follows from the triangle inequality, the second
property of Tt above, and the fact that t0 = 1/n4 that
kU k2 − kTt U k2 = (kU k + kTt U k) kU k − kTt U k
√
≤ (2kU k+kU −Tt U k) kU −Tt U k ≤ 2 6nd·t0 ≤ 6n−3 .
With probability 1, the matrix Ṽ := (Vq1 , Vq2 , . . . , Vqd )
has rank d. Hence we can solve Ui0 − Ui uniquely, and
kUi0 − Ui k ≤
δ
k∆xk
≤
.
σd (Ṽ )
σd (Ṽ )
Using |1 − e|x| | ≤ 2|x| for |x| < 1 and combining with
(D.8), we have
Using the tail bound on the minimum
√ singular value
given in [42], we have Pr{σd (ṽ) ≤ / d} ≤ C + cd ,
where C > 0 and 0 < c < 1 are absolute constants.
Choosing = n−4 and recalling that d ≤ n, we see that
with probability at least 1 − Cn−4 − cd , it holds that
σd (Ṽ ) ≥ 1/n9/2 and thus
|p(U ) − p(Tt U )| ≤ p(U ) · 12n−3 .
kUi0 − Ui k ≤ n9/2 δ = n9/2+1/4 γ ≤ n−5 .
(D.9)
Similarly it holds that kV k2F ≤ 6nd−t0 with probability This proves property 2 of this case. To show property
≥ 1 − n−3 and when this happens,
3 in this case, we can show something stronger, which is
T−t ◦ Tt = id, namely, this step is invertible. Indeed, if
−3
|p(V ) − p(Tt V )| ≤ p(V ) · 12n .
we replace ∆x with −∆x in (D.13) the solution Ui0 − Ui
It then follows that
will be the opposite sign as well.
Case 1b. Suppose that the q-th column contains s ≤ d
(D.10)
Z
entries read. Similarly to Case 1a we have Ui0 = Ui and
1
|p(U ) − p(Tt U )|p(V )dU dV = O
,
kVi0 − Vi k ≤ n−5 with probability ≥ 1 − Cn−4 − cd . The
n3
Egood (A)
invertibility is similar as in Case 1a and therefore holds.
(D.11)
Case 2. Suppose that there are more than d entries
Z
1
read in both the p-th row and the q-th column. Define
p(Tt U )|p(V ) − p(Tt V )|dU dV = O
3
n
Egood (A)
J = {i ∈ [n] : i-th column has ≤ d entries contained in S}
Plugging (D.10) and (D.11) into (D.7) yields that
Cr = {i ∈ [n] : (r, i) ∈ S}, Rc = {i ∈ [n] : (i, c) ∈ S}
Call the columns with index in J good columns and
Pr(A) − Pr(A + tei ) = O 1 + dcd ,
(D.12)
L1
L1
n2
those with index in J c bad columns.
Note that |J c | ≤ d since the total number of entries
which, combined with (D.5), (D.2), and (D.3), finally
in S is at most d2 . Take the columns in J c ∩ Cp , and
notice that q ∈ J c ∩ Cp . As in Case 1, we can change
c
Up to Up0 such that U 0 V T agrees with the perturbed
dT V (L1 , L2 ) ≤ dT V (L1 , L2 (G)) + 2 Pr{F (δ) }
entry of U V T at (p, q) and keeps the entries of U V T
= O k(n−2 + dcd ) .
the same for all (p, q 0 ) for q 0 ∈ (J c ∩ Cp ) \ {q}, since
Construction of Tt . Now we construct Tt . Sup- |(J c ∩ Cp ) \ {q}| ≤ d − 1.
pose the entry to be perturbed is (p, q).
However, this new choice of U 0 possibly causes
0
Case 1a. Suppose that the p-th row contains s ≤ d (U V T )p,i 6= (U V T )p,i for i ∈ Cp ∩ J. For each
entries read, say at columns q1 , . . . , qs . Without loss of i ∈ Cp ∩ J, we also need to change Vi to a vector Vi0
generality, we can assume that s = d, as we can preserve without affecting the entries read in any bad column.
the entries in S, perturbing one of them, and preserve Now for each good column, the matrix Ũ used in Case
d − s arbitrary additional entries in the p-th row.
1 applied to each i in Cp ∩ J is no longer i.i.d. Gaussian
Then we have (Ui denotes the i-th row of U and Vi because one row of Ṽ has been changed, and this change
the i-th column of V )
has `2 norm at most n−5 (the guarantee on property
2 in Case 1) with probability at least 1 − 4n−3 − cd ,
Ui Vq1 Vq2 · · · Vqd = x
and since the minimum singular value is a 1-Lipschitz
0
function of matrices, the minimum singular value is
and we want to construct Ui such that
perturbed by at most n−5 . Hence for each good column
Ui0 Vq1 Vq2 · · · Vqd = x + ∆x
i, with probability at least 1 − 4n−3 − cd , we have
kVi − Vi0 k ≤ n−5 . Since there are at most d good
Hence property 1 automatically holds and
columns, by a union bound, with probability at least
(D.13)
(Ui0 − Ui ) Vq1 Vq2 · · · Vqd = ∆x
1 − 4/n2 − dcd we have kV − V 0 kF ≤ n−4 . This
concludes the proof of properties 1 and 2 in Case i.e.,
 


2.
0
Up1
The final step is to verify property 3. Suppose
 .. 
 .. 
.
 . 
that Tt (U, V ) = (U 0 , V 0 ) and T−t (U 0 , V 0 ) = (U 00 , V 00 ),
 
 00

00
00
 


we want to show that U = U and V = V . Observe
 Up  (Vi − Vi ) = 0 ,
0
00
c
.


.
that Vi = (V )i = (V )i for i ∈ J ∩ Cp = {q1 , . . . , qd },
 .. 
 .. 
we have
0
Upd
(Up0 − Up ) Vq1 Vq2 · · · Vqd = ∆x
whence it follows immediately that Vi00 = Vi provided
00
0
that (UpT1 , . . . , UpTd ) is invertible. Together with Vi =
(Up − Up ) Vq1 Vq2 · · · Vqd = −∆x
Vi0 = Vi00 for all i ∈ Cp ∩ J c , we conclude that V 00 = V .
Adding the two equation yields
E Proof of Theorem 4.1
(U 00 − Up ) Vq1 Vq2 · · · Vq = 0,
p
d
Up00
and thus
= Up provided that (Vq1 , Vq2 , . . . , Vqd ) is
invertible. Since Ui = Ui0 = Ui00 for all i 6= p, we have
U = U 00 . Next we show that Vi00 = Vi for each i ∈ Cp ∩J.
Suppose Ri = {p1 , . . . , pd } 3 p. Similarly to the above
we have that

 0 

Up1
0

 .. 

..

 . 

.

 0 0

 Up  (Vi − Vi ) = −hUp0 − Up , Vi i





 . 

..

 .. 

.
0
0
Upd
We say that two cycles σ = ({i}, {j}) and τ =
({i0 }, {j 0 }) are (m1 , m2 )-disjoint if |i∆i0 | = 2m1 and
|j∆j 0 | = 2m2 , denoted by |σ∆τ | = (m1 , m2 ).
Proof. Let X = U ΣV be the SVD of X. Let G and
H be random matrices with i.i.d. N (0, 1) entries. By
rotational invariance, GXH T is identically distributed
as GΣH T . Let k = n1−2/p and X̃ be the upper-left k×k
block of GΛH T =: X̃. It is clear that
X̃s,t =
n
X
σi Gi,s Hi,t
i=1
and



Up001
0


 .. 
..


 . 
.


 00  00
 Up  (Vi − Vi0 ) = −hUp00 − Up0 , Vi0 i .






 . 
..


 .. 
.
00
0
Upd

Up00
= Up
Adding the two equations and recalling that
and Ui = Ui0 = Ui00 for all i 6= p, we obtain that




0
U p1
 .. 


..
 . 



 00
 0 .

 Up  (Vi − Vi ) + Up − Up  (Vi0 − Vi )




 . 


..
 .. 


.
Upd
0


0


..


.
 0

0

= hUp − Up , Vi − Vi i
,


..


.
0
Define
Y =
1 X
X̃σ .
|C|
σ∈C
Suppose that σ = ({is }, {js }) then
X̃σ =
q
Y
X
σ `s σ m s
`1 ,...,`q s=1
m1 ,...,mq
q
Y
G`s ,is H`s ,js Gms ,is+1 Hms ,js
s=1
P 2q
It is easy to see that EX̃σ =
σi = kAkpp (all {`s }
and {ms } are the same) and thus EY = kXkpp . Now we
compute EY 2 .
EY 2 =
q
q
1 X X
|C|2 m =0 m =0
1
2
X
E(X̃σ X̃τ ).
σ,τ ∈C
|σ∆τ |=(m1 ,m2 )
Suppose that |σ∆τ | = (m1 , m2 ),
X Yq
E(X̃σ X̃τ ) =
σ`i σmi σ`0i σm0i ·
`1 ,...,`q
`01 ,...,`0q
m1 ,...,mq
m01 ,...,m0q
E
nYq
i=1
o
G`s ,is Gms ,is+1 G`0s ,i0s Gm0s ,i0s+1 ·
s=1
nYq
o
E
H`s ,js Hms ,js H`0s ,js0 Hm0s ,js0 .
s=1
Algorithm 2 The sketching algorithm for odd p ≥ 3.
Input: n, > 0, odd integer p ≥ 3 and PSD A ∈ Rn×n
E(X̃σ X̃τ ) .m,p
1: N ← Ω(−2 )

1−2/p
2(2q−2m1 −1)
2(m2 +1)
4(m1−m2 )
× n matrices with
kAk2
kAk2(m2 +1) kAk4
, m2 ≤ m1 ≤ q−1 2: Let {Gi } be independent n




i.i.d. N (0, 1) entries.
2(2q−2m2 −1)
2(m1 +1)
4(m2−m1 )

kAk2(m1 +1) kAk4
, m1 ≤ m2 ≤ q−1

kAk2
3: Maintain each Gi XGT
i , i = 1, . . . , N
4(q−m2 −1)
4(m2 +1)
kAk4
kAk2(m2 +1) ,
m1 = q, m2 ≤ q−1 4: Compute Z as defined in (F.14)


4(q−m1 −1)
4(m +1)


kAk4
kAk2(m11 +1) ,
m2 = q, m1 ≤ q−1 5: return Z



4q
m1 = m 2 = q
kAk2q ,
It is not difficult to see that (see Appendix G for details)
2
By Hölder’s inequality, kAk22 ≤ n1− p kAk2p and
2(m+1)
kAk2(m+1) ≤ n1−
2(m+1)
p
dent on p only. Since
kAk2(m+1)
p
Z=
Thus,
N
1 X
Yi ,
N i=1
then Yi ’s are i.i.d. copies of Y . It follows that
 p−m −m −2
2p
1
2
n
kAk
,
m
,
m
≤
q
−
1

1
2
p


EX = EY = kAkpp
nq−m1 −1 kAk2p ,
p2 = q, p1 ≤ q − 1
p
E(X̃σ X̃τ ) .m,p

p1 = q, p2 ≤ q − 1 and
nq−m2 −1 kAk2p

p ,


2p
EY 2
1
V ar(Y )
m1 = m2 = q.
kAkp ,
≤
= 2 kAkp2p .
V ar(X) =
N
N
4
There are Oq (k q+m1 k q+m2 ) = Op (k p+m1 +m2 ) pairs of
Finally, by Chebyshev inequality,
(m1 , m2 )-disjoint cycles and |C| = Θ(k p ),
V ar(X)
1
X
1
Pr X − kAkpp > kAkpp ≤
2p ≤ 4 ,
E(A
A
)
σ
τ
2
kAkp
|C|2
σ,τ ∈C
|σ∆τ |=(m1 ,m2 )
1
|C|2
np−m1 −m2 −2
≤ Cm1 +m2 ,p p−m1 −m2 kAk2p
p
k
2p
≤ Cm1 +m2 ,p kAkp , m1 , m2 ≤ q − 1,
X
E(Aσ Aτ )
σ,τ ∈C
|σ∆τ |=(m1 ,m2 )
nq−m1 −1
kAk2p
p
k q−m1
p
≤ m2 = q, m1 ≤ q − 1,
≤ Cm1 ,p kAk2p
p ,
2
X
1
E(Aσ Aτ )
2
|C|
≤ Cm1 +m2 ,p
F
Algorithms for Odd p
Given integers k and p < k, call a sequence (i1 , . . . , ip )
a cycle if ij ∈ [k] for all j and ij1 6= ij2 for all j1 6= j2 .
On a k × k matrix A, each cycle σ defines a product
Aσ =
p
Y
Aσi ,σi+1
i=1
≤ Cm1 +m2 ,p
where we adopt the convention that ip+1 = i1 . Let C
denote the set of cycles. Call two cycles σ, τ ∈ C kdisjoint if |σ∆τ | = 2k, where σ and τ are viewed as
multisets.
σ,τ ∈C
|σ∆τ |=(m1 ,m2 )
Theorem F.1. With probability ≥ 3/4, the output X
returned by Algorithm 2 satisfies (1 − )kAkpp ≤ X ≤
(1 + )kAkpp when A is positive semi-definite. The
algorithm is a bilinear sketch with r·s = Op (−2 n2−4/p ).
σ,τ ∈C
|σ∆τ |=(m1 ,m2 )
1
|C|2
which implies the correctness of the algorithm. It is easy
to observe that the algorithm only reads the upper-left
k × k block of each Gi XGTi , thus it can be maintained
in O(N k 2 ) = Op (−2 n2−4/p ) space.
np−2m2 −2
kAk2p
p
k q−m2
p
≤ m1 = q, m2 ≤ q − 1,
≤ Cm1 ,p kAk2p
p ,
2
X
E(Aσ Aτ ) ≤ kAk2p
m1 + m2 = p.
p ,
Therefore EY 2 ≤ Cp kAk2p
p for some constant Cp depen- Proof. Since X is symmetric it can be written as
X = OΛOT , where Λ is a diagonal matrix and O an
orthogonal matrix. Let G be a random matrix with
i.i.d. N (0, 1) entries. By rotational invariance, GXGT
is identically distributed as GΛGT . Let k = n1−2/p and
X̃ be the upper-left k × k block of GΛGT := X̃. It is
clear that
n
X
X̃s,t =
λi Gi,s Gi,t
i=1
Define
Y =
1 X
X̃σ .
C
σ∈C
1
|C|2
n
X
X̃σ =
λj1 · · · λjp
j1 ,...,jp =1
then Yi ’s are i.i.d. copies of Y . It follows that
Gj` ,i` Gj` ,i`+1
X
EX = EY = kAkpp
E(X̃σ X̃τ ).
2(p−m−1)
2(m+1)
kAk2(m+1)
2
By Hölder’s inequality, kAk22 ≤ n1− p kAk2p and
2(m+1)
p
kAk2(m+1)
,
p
2(m + 1) ≤ p
Thus,
E(X̃σ X̃τ ) .m,p
V ar(Y )
EY 2
1
≤
= 2 kAkp2p .
N
N
4
Finally, by Chebyshev inequality,
σ,τ ∈C
σ,τ are m-disjoint
E(X̃σ X̃τ ) .m,p kAk2
2(m+1)
N
N
1 X 1 X
1 X
(Gi XGTi )σ =:
Yi ,
N i=1 |C|
N i=1
σ∈C
It is not difficult to see that (see Appendix G for details),
when m ≤ p − 2, it holds for m-disjoint σ and τ that
kAk2(m+1) ≤ n1−
Z=
`=1
2
m = p.
σ,τ ∈C
|σ∆τ |=2m
(F.14)
P p
and
It is easy to see that EX̃σ =
λi = kAkpp (all j` ’s are
the same) and thus EY = kXkpp . Now we compute EY 2 .
p
1 X
EY =
|C|2 m=0
E(Aσ Aτ ) ≤ kAk2p
p ,
Therefore EY 2 ≤ Cp kAk2p
p for some constant Cp
dependent on p only. Hence if we take multiple copies
of this distribution as stated in Algorithm 2 and define
Suppose that σ = (i1 , . . . , ip ) then
p
Y
X
(
np−m−2 kAk2p
p , 2(m + 1) ≤ p
k p−m−1 kAk2p
p , otherwise.
V ar(X) =
V ar(X)
1
≤ ,
Pr X − kAkpp > kAkpp ≤
4
2 kAk2p
p
which implies the correctness of the algorithm. It is easy
to observe that the algorithm only reads the upper-left
k × k block of each Gi XGTi , thus it can be maintained
in O(N k 2 ) = Op (−2 n2−4/p ) space.
G
Omitted Details in the Proof of Theorem F.1
Suppose that σ = (i1 , . . . , ip ) and τ = (j1 , . . . , jp ) are
m-disjoint. Then
X
E(X̃σ X̃τ ) =
λ`1 · · · λ`p λ`01 · · · λ`0p ·
`1 ,...,`p
`01 ,...,`0p
And when σ, τ are (p − 1)- and p-disjoint,
E(Aσ Aτ ) ≤ kAk2p
p .
(
E
p
Y
)
G`s ,is G`s ,is+1 G`0s ,js G`0s ,js+1
.
s=1
There are Op (k p+m ) pairs of m-disjoint cycles and |C| = For the expectation to be non-zero, we must have
Θ(k p ),
each appeared entry repeated an even number of times.
Hence, if some is appears only once among the index
X
{is } and {js }, it must hold that `s = `s+1 . Thus for
1
np−m−2
E(Aσ Aτ ) ≤ Cm,p p−m kAk2p
p
each of the summation term the indices {`s } breaks
|C|2
k
σ,τ ∈C
into a few blocks, in each of which all `s takes the same
|σ∆τ |=2m
value, the same holds for {`0s }. We also need to piece
p
≤ Cm,p kAk2p
,
m
≤
−
1,
the blocks of {`s } which those of {`0s }. Hence the whole
p
2
sum breaks into sums corresponding to different block
X
1
k p−m−1
E(Aσ Aτ ) ≤ Cm,p p−m kAk2p
configurations. For a certain kind of configuration,
p
2
|C|
k
σ,τ ∈C
in which `1 , . . . , `w are free variables with multiplicity
|σ∆τ |=2m
r1 , . . . , rw respectively, the sum is bounded by
p
X
− 1 < m ≤ p − 2,
≤ Cm,p kAk2p
p ,
2
C·
λr`11 · · · λr`ww ≤ CkAkrr11 · · · kAkrrw
w
X
1
1
`1 ,...,`w
2p
E(Aσ Aτ ) ≤ kAkp , m = p − 1
|C|2
k
σ,τ ∈C
|σ∆τ |=2m
where the constant C depends on the configuration
only, and thus can be made on m and p by taking
the maximum constant over all possible block configurations.
P Notice that in a configuration, all rw ’s are
even,
rw = 2p and w ≤ p − m. Note that
s+1
kAkrr kAkss ≤ kAkr−1
r−1 kAks+1 ,
kAkr+s
r+s
≤
r>s
kAkrr kAkss ,
p ∈ (2, ∞), and Ip = Jp = 1 when p = 2.
Denote by D0 the distribution of X and by D1 that
of Y . Suppose that the k linear forms are a1 , . . . , ak .
Choose X from D0 with probability 1/2 and from D1
with probability 1/2. Let b ∈ {0, 1} indicate that
X ∼ Db . Without loss of generality, we can assume
that a1 , . . . , ak are orthonormal. Hence (a1 , . . . , ak ) ∼
N (0, Ik ) when b = 0 and (a1 , . . . , ak ) ∼ Dn,k when
b = 1. Then with probability 3/4 − o(1) we have that
it is easy to see that the worst case of configuration is
w = p − m and (r1 , . . . , rw ) = (2(m + 1), 2 . . . , 2), giving (1 − cp )((Ip + o(1))n ≤ Y ≤ (1 + cp )((Ip + o(1))n, b = 0
the bound
(1 − cp )((Jp + o(1))n ≤ Y ≤ (1 + cp )((Jp + o(1))n, b = 1
2(p−m−1)
CkAk2
2(m+1)
kAk2(m+1) .
so we can recover b from Y with probability 1 −
|Ip −Jp |
Finally, observe that the number of configurations is a o(1), provided that cp ≤ 2(Ip +Jp ) . Consider the
constant depends on p and m only, giving the variance event E that the algorithm’s output indicates b =
1.
Then Pr(E|X ∼ D0 ) ≤ 1/4 + o(1) while
claim.
Pr(E|X ∼ D1 ) ≥ 3/4 − o(1). By definition of total variation distance, dT V (Dn,k , N (0, Ik )) is at least
H Proof of Theorem 5.4
|Pr(E|X
∼ D0 ) − Pr(E|X ∼ D1 )|, which is at least
p
p
Proof. First we show that kXkp and kY kp differ by a 1
+
o(1).
On the other hand, √by Theorem 5.3,
p
constant factor. We have calculated kXkp = (Ip +o(1))n 2
d
(D
,
N
(0,
Ik )) ≤ 14 when k ≤ c n for some c > 0.
T
V
n,k
in (3.3). Similarly, it holds that
We meet
√ a contradiction. Therefore it must hold that
kY kpp = (Jp + o(1)) n,
k > c n.
where
(H.15)
p
Z
b
p
x2
Jp = 2 2
a
p
(b − x)(x − a)
dx,
πx
a, b =
3 √
∓ 2.
2
Extend the definition of Ip and Jp to p = 0 by Ip = 1
and Jp = 1/2. This agrees with kXk0 = rank(X). Now
it suffices to show that Ip 6= Jp when p 6= 2. Indeed, let
us consider the general integral
√
√
Z (1+√β)2 p
((1 + β)2 − x)(x − (1 − β)2 )
γ
x
dx
√
2πβx
(1− β)2
√
Change the variable y = (x − (1 + β))/ β, the integral
above becomes
Z 2 p
p
1
( βy + (1 + β))γ−1 4 − y 2 dy.
2π −2
Hence
1
Jp − Ip =
2π
Z
2
p
fp (x) 4 − x2 dx,
−2
where
√
p
p
fp (x) = ( 2x + 3) 2 −1 − (x + 2) 2 −1 .
One can verify that fp < 0 on (−2, 2) when 0 < p < 2,
fp > 0 on (−2, 2) when p > 2 and fp = 0 when p = 2.
It is easy to compute that I2 = 1. The case p = 0 is
trivial. Therefore, Ip > Jp for p ∈ [0, 2), Ip < Jp for
```