Spielman-Srivastava Sparsification à la Talagrand

This is the second in a series of posts explaining a result on hypergraph sparsification that uses Talagrand’s work on Gaussian processes.

In the previous post we talked about Gaussian and sub-Gaussian processes and generic chaining.

In this post we talk about the Spielman-Srivastava probabilistic construction of graph sparsifiers. Their analysis requires a bound on the largest eigenvalue of a certain random matrix, that can be derived from matrix Chernoff bounds.

We will then make our life harder and we will also derive an analysis of the Spielman-Srivastava construction by casting the largest eigenvalue of that random matrix as the sup of a sub-Gaussian process, and then we will apply the machinery from the previous post.

This will be more complicated than it needs to be, but the payoff will be that, as will be shown in the next post, this more complicated proof will also apply, with some changes, to the setting of hypergraphs.

1. Graph Sparsifiers

For a graph {G} having vertex set {V}, if {S\subseteq V} is a subset of vertices, we denote by {cut_G(S)} the number of edges that cross the cut {(S,V-S)}, that is, that have one endpoint in {S} and one endpoint outside {S}. If {G} is a weighted graph, {cut_G(S)} is the sum of the weights of such edges.

A graph {G'} is a cut sparsifier of {G} with error at most {\epsilon} if {G} and {G'} have the same vertex set {V} and, for every {S\subseteq V}, we have

\displaystyle  1-\epsilon \leq \frac{cut_G(S)}{cut_{G'} (S) } \leq 1+\epsilon

that is, if all cuts in {G} and {G'} are the same, up to a multiplicative error {1\pm \epsilon}. (We use the convention that {0/0 = 1}.) This definition was introduced by Benczur and Karger who showed that every graph has a cut sparsifier with at most {O(\epsilon^{-2} n \log n)} edges, where {n} is the number of vertices, and that such a sparsifier can be constructed by an efficient probabilistic algorithm.

This is a useful construction because if one wants to run on {G} an algorithm that approximately solves a problem that depends on the cut structure of {G} (for example min cut, min st-cut, max flow, or an algorithm for a clustering problem), then one may alternatively run such algorithm on {G'}, and an approximate solution for {G'} will also be an approximate solution for {G}. The advantage is that one runs the algorithm on a graph that has fewer edges, and so one needs less time and less memory to run the algorithm.

The approach of Benczur and Karger is to assign to each edge {e} of {G} a probability {p_e}, in a certain careful way that we will not describe. Then they generate {G'} by doing the following independently for every edge {e}: with probability {p_e} we put the edge {e} in {G'}, and give it weight {1/p_e}, and with probability {1-p_e} we do not put the edge {e} in {G'}. For every cut {S}, we have that {cut_{G'}(S)} is a random variable with expectation {cut_G(S)}, and Benczur and Karger are able to use Chernoff bounds and a union bound to show that there is a setting of those probabilities such that it is likely that all cuts will be preserved with multiplicative error at {1\pm \epsilon} and such that {\sum_e p_e = O(\epsilon^{-2} n \log n)}. The union bound has to be done very carefully and, in particular, one has to use the fact that there can be few sparse cuts.

Spielman and Teng introduced the stronger definition of spectral sparsifier: according to their definition, a graph {G'} is a spectral sparsifier of {G} with error at most {\epsilon} if, for every vector {x\in {\mathbb R}^V} we have

\displaystyle  1-\epsilon \leq \frac{ x^T L_G x}{x^T L_{G'} x} \leq 1+ \epsilon

where {L_G} is the Laplacian matrix of {G} and, as before, we take the convention that {0/0 = 1}. This is a stronger condition because, if {x} is the indicator vector of a set {S}, then {x^T L_G x = cut_G(S)}, so the Benczur-Karger condition is equivalent to the special case of the Spielman-Teng condition in which we only quantify over Boolean vectors {x\in \{ 0,1\}^V}.

Spielman and Teng gave an efficient construction of spectral sparsifiers with {O(\epsilon^{-2} n \log^{O(1)} n)} edges.

Later, Spielman and Srivastava reduced the number of edges to {O(\epsilon^{-2} n \log n)}, with a proof similar to that of Benczur and Karger’s: they attribute a probability {p_e} to each edge, and then sample each edge {e} with probability {p_e}, weighing it {1/p_e} if selected. The Laplacian {L_{G'}} of the resulting weighted graph satisfies {\mathop{\mathbb E} L_{G'} = L_G}, and so one has to study the concentration of the random matrix {L_{G'}} around its average, which is doable using matrix Chernoff bounds.

More recently, Batson, Spielman and Srivastava gave a construction of spectral sparsifiers with {O(\epsilon^{-2} n)} edges. Their construction is deterministic, and it proceeds by choosing one edge at a time, in a process that is driven by a certain potential function. Allen-Zhu, Liao and Orecchia have presented an interpretation of Batson-Spielman-Srivastava as an online optimization game played using a Follow-the-Regularized-Leader strategy. As my sequence of posts on online optimization will continue after the current hiatus, I plan to present the results of Allen-Zhu, Liao and Orecchia.

2. The Spielman-Srivastava Construction

We will assume that the graph {G} is connected, otherwise we can apply the construction below to each connected component.

In the Spielman-Srivastava construction, we assign a probability {p_e} to each edge, and then we want to say that the event

\displaystyle   \forall x \in {\mathbb R}^n: \ \ \ - \epsilon x^T L_G x \leq x^T L_{G'} x - x^T L_G x \leq \epsilon x^T L_G x \ \ \ \ \ (1)

holds with high probability, where {G'} is the random graph obtained by sampling each edge {e} of {G} independently with probability {p_e}, and weighing it {1/p_e} if selected, and {L_{G'}} is the Laplacian of {G'}.

Actually, Spielman and Srivastava sample from a slightly different distribution. Namely they repeatedly sample from a distribution on all edges, in which edge {e} has probability proportional to {p_e}, because this distribution is easier to analyze with matrix Chernoff bounds, but we will proceed to analyze the distribution described in the above paragraph.

Matrix Chernoff bounds give upper bounds to the probability that a sum of independent random matrices deviates in operator norm from its average, and take the form of

\displaystyle   \Pr \left[ \left\| \sum_e (M_e - \mathop{\mathbb E} M_e ) \right\| \geq \ell \right] \leq \cdots \ \ \ \ \ (2)

where the {M_e} are independent random matrices. If the matrices are {n\times n} Hermitian, the most general case is given by the matrix Bernstein bound

\displaystyle   \Pr \left[ \left\| \sum_e (M_e - \mathop{\mathbb E} M_e ) \right\| \geq \ell \right] \leq 2n \cdot \exp \left( \frac{-\ell^2}{2\sigma^2 + \frac 23 B\ell} \right) \ \ \ \ \ (3)


\displaystyle  \sigma^2 := \left\| \sum_e \mathop{\mathbb E} (M_e - \mathop{\mathbb E} M_e )^2 \right\|

is the “variance” of {\sum_e M_e}, and {B} is an upper bound such that

\displaystyle  \forall e\ \ ||M_e|| \leq B

holds with probability one.

It remains to reformulate (1) in a form like (2). We first note that

\displaystyle  L_G = \sum_{e \in E} L_e

where {L_e} is the Laplacian matrix of the edge {e}. That is, if {e = (u,v)}, then {L_e} is the rank-1 matrix whose quadratic form is {x^T L_e x = (x_u - x_v)^2}.

So we can write

\displaystyle  x^T L_{G'} x - x^T L_G x = \sum_e z_e L_e

where {z_e} is a random variable that is equal to {1/p_e - 1} with probability {p_e} and to {-1} with probability {1-p_e}.

We also note that all the terms in (1) are invariant under shifts in {x}, so we can rewrite the event (1) as

\displaystyle  \forall x\in {\mathbb R}^n_\perp : \ \ - \epsilon x^T L_G x \leq \sum_e z_e x^T L_e x \leq \epsilon x^T L_G x \ \ \ \ \ (4)

Where {{\mathbb R}^n _\perp} is the set of vectors {x\in {\mathbb R}^n} that are orthogonal to {(1\cdots 1)}. If we apply the change of variable {y = L_G^{1/2} x}, which is bijective on {{\mathbb R}^n_\perp}, the previous event becomes

\displaystyle  \forall y\in {\mathbb R}^n_\perp : \ \ - \epsilon ||y||^2 \leq \sum_e z_e y^T L_{G}^{-1/2} L_e L^{-1/2}_G y \leq \epsilon||y||^2 \ \ \ \ \ (5)

which is implied by (and actually equivalent to):

\displaystyle  \left\| \sum_e z_e L_{G}^{-1/2} L_e L^{-1/2}_G \right \| \leq \epsilon \ \ \ \ \ (6)

Looking at the matrix Bernstein bound, we want to choose the {p_e} so that, say,

\displaystyle   \left\| \sum_e \mathop{\mathbb E} (z_e L_{G}^{-1/2} L_e L^{-1/2}_G )^2\right \| < \frac {\epsilon^2}{4\log n} \ \ \ \ \ (7)

and also so that

\displaystyle   \forall e \ \ \ ||z_e L_{G}^{-1/2} L_e L^{-1/2}_G|| < \frac{\epsilon}{4\log n} \ \ \ \ \ (8)

holds with probability 1.

The variance term (7) is the spectral norm of

\displaystyle  \sum_e ( \mathop{\mathbb E} z_e^2 ) \cdot ( L_{G}^{-1/2} L_e L^{-1/2}_G )^2

\displaystyle  = \sum_e \left( \frac 1 {p_e} - 1 \right) \cdot ( L_{G}^{-1/2} L_e L^{-1/2}_G )^2

\displaystyle  = \sum_e \left( \frac 1 {p_e} - 1 \right) \cdot || L_{G}^{-1/2} L_e L^{-1/2}_G || \cdot L_{G}^{-1/2} L_e L^{-1/2}_G

where we computed {\mathop{\mathbb E} z_e^2} and we used the fact that if {M} is a rank-1 real-valued symmetric matrix then {M^2 = ||M|| \cdot M}.

So we have

\displaystyle  \sum_e ( \mathop{\mathbb E} z_e^2 ) \cdot ( L_{G}^{-1/2} L_e L^{-1/2}_G )^2

\displaystyle  \preceq \left( \max_e \left( \frac 1 {p_e} - 1 \right) \cdot || L_{G}^{-1/2} L_e L^{-1/2}_G || \right) \cdot \sum_e L_{G}^{-1/2} L_e L^{-1/2}_G

\displaystyle  \preceq \left( \max_e \left( \frac 1 {p_e} - 1 \right) \cdot || L_{G}^{-1/2} L_e L^{-1/2}_G || \right) \cdot I

If we set

\displaystyle  p_e = \min \left\{ 4 \frac{\log n}{\epsilon^2} ||L_{G}^{-1/2} L_e L^{-1/2}_G || , 1 \right\}

then we see that we satisfy both (7) and (8).

The term { || L_{G}^{-1/2} L_e L^{-1/2}_G ||} is the effective resistance of the edge {e} and it is usually denoted as {R_e}. It is known that {\sum_eR_e = n-1} for connected graphs. Thus we have

\displaystyle  \sum_e p_e = O(\epsilon^{-2} n \log n)

as promised.

3. Analysing the Construction as a Sub-Gaussian Process

Now we would like to provide an analysis of the Spielman-Srivastava construction in terms of bounding the sup of a sub-Gaussian process via the Talagrand comparison inequality. Indeed we can think of our goal as showing that the following two random variables are both {\leq \epsilon} with high probability:

\displaystyle  \sup_{x: x^T L_G x = 1 } \ \ \sum_e z_e x^T L_e x

\displaystyle  \sup_{x: x^T L_G x = 1 } \ \ \sum_e -z_e x^T L_e x

where the {z_e} are the random variables defined in the previous section. Unfortunately, while random processes that involve weighted sums of Rademacher random variables are well suited to such an analysis, random processes that involve highly biased Boolean random variables do not work very well in this framework.

To avoid this problem, we will think of performing the Spielman-Srivastava sampling in phases, such that each phase involves unbiased Boolean random variables.

To avoid having to deal with too many constants, we will think of setting {p_e = \min \{ 1, \epsilon^{-2} R_e \log n \}}, with a goal of achieving sparsification with error {O(\epsilon)}. We will want to think of the process of sampling an edge as a sequence of unbiased Boolean choices, so it will be convenient to round up probabilities to non-positive powers of 2. So we will set edge probabilities such that {p_e = 2^{-j_e}} for some integer {j_e \geq 0} and such that

\displaystyle  \min \{ 1, \epsilon^{-2} R_e \log n \} \leq p_e \leq \min \{ 1, 2 \epsilon^{-2} R_e \log n \}

If we let {\min_e p_e = 2^{-\ell}}, we will think of the process of sampling {G'} as proceeding in {\ell} rounds. If {p_e = 2^{-j_e}} then, in each of the last {j_e} round (that is, in round {\ell -j_e +1} through {\ell}), we choose with probability 1/2 to delete the edge {e} and with probability 1/2 to double its weight.

(Why do we do it in the last {j_e} rounds and not in the first {j_e} rounds? This issue confused me a lot at some point. Hold on to this question until later.)

Let us call {G^{(k)}} the graph obtained after round {k}, so that {G^{(0)} = G} and {G^{(\ell)} = G'}. We see that

\displaystyle  \sup_{x: x^T L_G x = 1 } \ \ \left | \sum_e z_e x^T L_G x \right| \leq \sum_{k=1}^\ell \sup_{x: x^T L_G x = 1 } \ \ \left | x^T (L_{G^{(k)}} - L_{G^{(k-1)}} ) x \right|

Let us now understand the quadratic form in each of the above term. If we let {w^{(k)}_e} be the weight of edge {e} in graph {G^{(k)}}, we have

\displaystyle  x^T (L_{G^{(k)}} - L_{G^{(k-1)}} ) x = \sum_e (w^{(k)}_e - w^{(k-1)}_e ) \ x^T L_e x

Regarding the weights, if we consider an edge {e} such that {p_e = 2^{-j_e}}, we have that the edge {e} is left untouched in round {k} if {\ell - j_e +1 > k}, that is, if {j_e \leq \ell -k}. In that case, {w^{(k)}_e - w^{(k-1)}_e=0}. If {j_e \geq \ell -k+1}, then {w^{(k)}_e} is equally likely to be {0} or { 2w^{(k-1)}_e}. In other words, {w^{(k)}_e - w^{(k-1)}_e = r_e^{(k)} w^{(k-1)}} where {r_e^{(k)}} is a Rademacher random variable.

Putting everything together, we have

\displaystyle  x^T (L_{G^{(k)}} - L_{G^{(k-1)}} ) x = \sum_{e : p_e \leq 2^{k-1-\ell}} \ r_e^{(k)} w^{(k-1)} \ x^T L_e x

and now we are in good shape because Rademacher sums are well suited to be analyzed as sub-Gaussian processes. In particular, we will able to prove the following lemma.

Lemma 1 There are bounds {\epsilon_k = O(\epsilon \cdot 2^{(k-\ell)/2})} such that for every outcome of the first {k-1} steps that satisfies {L_{G^{(k-1)}} \preceq 2 L_G}, we have

\displaystyle  \Pr \left[ \sup_{x: x^T L_G x = 1 } \ \ \left | x^T (L_{G^{(k)}} - L_{G^{(k-1)}} ) x \right| > \epsilon_k \right] \leq \frac 1n

Applying the lemma inductively, we see that we have probability at least {1-\ell/n} that

\displaystyle  \sup_{x: x^T L_G x = 1 } \ \ \left | \sum_e z_e x^T L_G x \right| \leq \sum_{k=1}^\ell \epsilon_k \leq O(\epsilon)

as desired, and it remains to observe that in every undirected graph every edge has effective resistance at least {1/n^{O(1)}} so that {\ell = O(\log n)}.

It will be convenient to do a change of variables and write

\displaystyle  \sup_{x: x^T L_G x = 1 } \ \ \left | x^T (L_{G^{(k)}} - L_{G^{(k-1)}} ) x \right|

\displaystyle  = \sup_{x: x^T L_G x = 1 } \left | \sum_{e : p_e \leq 2^{k-1-\ell}} \ r_e^{(k)} w^{(k-1)} \ x^T L_e x \right |

\displaystyle   = \sup_{y: ||y||=1 } \left | \sum_{e : p_e \leq 2^{k-1-\ell}} \ r_e^{(k)} w^{(k-1)} \ x^T L_G^{-1/2} L_e L_G^{-1/2} x \right | \ \ \ \ \ (9)

To lighten up the notation in our subsequent arguments, it will be convenient to give names to the matrices that we obtain after this change of basis, and call

\displaystyle  M_e := L_G^{-1/2} L_e L_G^{-1/2} \ \ \ \ \ (10)

\displaystyle  M := L_G^{-1/2} L_G L_G^{-1/2} = \sum_e M_e \preceq I \ \ \ \ \ (11)

\displaystyle  M^{(k)} := L_G^{-1/2} L_{G^{(k)}} L_G^{-1/2}= \sum_e w^{(k)} _e M_e \ \ \ \ \ (12)

In this notation, we have

\displaystyle  \min \{ 1, \epsilon^{-2} ||M_e|| \log n \} \leq p_e \leq \min \{ 1, 2 \epsilon^{-2} ||M_e|| \log n \} \ \ \ \ \ (13)

and recall that we assumed

\displaystyle  M^{(k-1)} \preceq 2 M \preceq 2I \ \ \ \ \ (14)

With this notation, the quantity in (9) that we want to bound becomes

\displaystyle  \sup_{y: ||y||=1 } | F(y) |


\displaystyle  F(y) = \sum_{e : p_e \leq 2^{k-1-\ell}} \ r_e^{(k)} w^{(k-1)}_e \ y^T M_e y

is a centered random process.

Define the centered Gaussian process

\displaystyle  \hat F(y) := \sum_{e : p_e \leq 2^{k-1-\ell}} \ g_e^{(k)} w^{(k-1)}_e \ y^T M_e y

where {g_e} are independent standard normal random variables. Then we immediately see that both {-F(\cdot)} and {F(\cdot)} are {O(1)}-dominated by {\hat F(\cdot)}, because

\displaystyle  || F(x) - F(y) ||^2_{\Psi_2} = O(1) \cdot \sum_{e : p_e \leq 2^{k-1-\ell}} \left( w^{(k-1)} _e\ x^T M_e x \right)^2


\displaystyle  (d(x,y))^2 = \mathop{\mathbb E} (\hat F(x) - \hat F(y) )^2 = \sum_{e : p_e \leq 2^{k-1-\ell}} \left( w^{(k-1)}_e \ x^T M_e x \right)^2

In order to deduce the lemma from the Talagrand comparison inequality it suffices to show that

\displaystyle  \mathop{\mathbb E} \sup_{x: ||x||=1} \hat F(x) \leq O(\epsilon_k)

and to have a bound on the diameter

\displaystyle  \sup_{x,y: ||x||=||y||=1} d(x,y) \leq O \left( \frac {\epsilon_k} {\sqrt {\log n}} \right)

To bound the average supremum of the Gaussian process we could use generic chaining, but fortunately there is a a matrix Chernoff bound that say that if {\{ A_e\}_{e\in E}} are real-valued symmetric matrices and {g_e} are independent standard normal random variables then

\displaystyle  \mathop{\mathbb E} \sup _{x : ||x||=1} \sum_e g_e x^T A_e x \leq \sqrt{2\left \| \sum_e A_e^2 \right \| \log 2n}

Applied to our setting,

\displaystyle  \mathop{\mathbb E} \sup_{x: ||x||=1} \hat F(x) \leq O(\sqrt {\log n}) \cdot \sqrt{\left \| \sum_{e : p_e \leq 2^{k-1-\ell}} \left( w^{(k-1)}_e \ M_e \right)^2 \right \|}


\displaystyle  \sum_{e : p_e \leq 2^{k-1-\ell}} \left( w^{(k-1)}_e \ M_e \right)^2

\displaystyle  \preceq \max_{e : p_e \leq 2^{k-1-\ell}} w^{(k-1)}_e \| M_e \| \cdot \sum_{e : p_e \leq 2^{k-1-\ell}} w^{(k-1)}_e \ M_e

\displaystyle  \preceq 2^{k-1-\ell} \frac 1 {p_e} \| M_e \| \cdot M^{(k-1)}

\displaystyle  \preceq 2^{k-1-\ell} \cdot \frac{\epsilon^2}{||M_e|| \log n} \cdot ||M_e|| \cdot M^{(k-1)}

\displaystyle  \preceq2^{k-1-\ell} \cdot \frac{\epsilon^2}{\log n} \cdot 2 \ I

so indeed

\displaystyle  \mathop{\mathbb E} \sup_{x: ||x||=1} \hat F(x) \leq O(\epsilon_k)

Now we can return to the question about the order in which we process the edges. By starting from the lowest-probability edges, we are also starting from the edges for which {||M_e||} is smallest. When we bound

\displaystyle  || \sum_e (w_e^{(k)} M_e)^2 || \leq \left( \max_e w_e^{(k)} ||M_e|| \right) \cdot ||\sum_e w_e^{(k)} M_e||

and the sum is over the edges {e} that are processed at round {k}, it is convenient to be able to say that {w_e^{(k)} ||M_e||} has a round-dependent upper bound. Indeed, if {e} is processed at round {k}, then {w_e^{(k)}} is either zero or {2^{k-1-\ell} /p_e}, so that {w_e^{(k)} ||M_e|| \leq 2^{k-1-\ell} \cdot ||M_e||/p_e}, and the terms {2^{k-1-\ell}} add up to {O(1)} when summed over rounds. If we had proceeded in the opposite direction, we would have only been able to bound {w_e^{(k)} ||M_e||} as an absolute constant times {||M_e||/p_e}, and we would have lost a factor of the order of the number of rounds in the analysis.

The diameter is bounded by analogous considerations. The square of the diameter

\displaystyle  \sup_{x,y : ||x||=||y||=1} (d(x,y))^2

\displaystyle  = \sum_{e: p_e \leq 2^{k-1-\ell}} \left(w^{(k-1)}_e \right)^2 \left( x^T M_e x - y^TM_ey \right)^2

\displaystyle  \leq \left( \sup_{x,y,e : ||u||=||v||=1,p_e \leq 2^{k-1-\ell} } w^{(k-1)}_e | x^T M_e x - y^TM_ey| \right)

\displaystyle \cdot \left( \sup_{x,y : ||x||=||y||=1} \sum_{e: p_e \leq 2^{k-1-\ell}} w^{(k-1)}_e | x^T M_e x - y^TM_ey | \right)

and we have

\displaystyle  \sup_{x,y,e : ||u||=||v||=1,p_e \leq 2^{k-1-\ell} } w^{(k-1)}_e | x^T M_e x - y^TM_ey |

\displaystyle  \leq 2^{k-\ell-1} \frac{\epsilon^2}{||M_e|| \log n} 2 ||M_e||

\displaystyle  \leq O\left ( \frac {\epsilon_k^2}{\log n} \right)


\displaystyle  \sup_{x,y : ||x||=||y||=1} \sum_{e: p_e \leq 2^{k-1-\ell}} w^{(k-1)}_e | x^T M_e x - y^TM_ey |

\displaystyle  \sup_{x,y : ||x||=||y||=1} \sum_{e: p_e \leq 2^{k-1-\ell}} w^{(k-1)}_e (x^T M_e x + y^TM_ey )

\displaystyle  \leq \sup_{x,y: ||x||=||y||=1} x^T M^{(k-1)} x + y^T M^{(k-1)} y

\displaystyle  \leq 4

Which gives the desired bound

\displaystyle  \sup_{x,y : ||x||=||y||=1} d(x,y) = O(\epsilon_k / \sqrt{\log n} )

Now the Talagrand comparison inequality gives us the lemma, and hence the analysis of the Spielman-Srivastava construction.

4. A Final Comment

There was a point that confused me for a while about this argument. Namely, we are not able to study

\displaystyle  \sup_{x : x^T L_G x=1} x^T(L_G - L_{G'}) x

by showing that

\displaystyle  x^T(L_G - L_{G'}) x

is dominated by a Gaussian process, because {L_{G'}} involves biased Boolean random variables which yield poor bounds when we try to dominate them by a Gaussian distribution. Instead we write

\displaystyle  x^T(L_G - L_{G'}) x = \sum_k x^T(L_{G^{(k-1)}} - L_{G^{(k)}}) x

and then we show how to dominate each term on the right-hand side by a Gaussian process. But then won’t the sum of those Gaussian processes dominate {x^T(L_G - L_{G'}) x}, which was not supposed to be possible?

But the point is that in the analysis of { x^T(L_{G^{(k-1)}} - L_{G^{(k)}}) x} we throw away the low-probability case that, in the previous processes, we ended with {L_{G^{(k-1)}} \not\preceq 2 L_G}. These low-probability events that we remove from consideration are enough to cut off the problematic tail of the discrete distribution that does not have a good Gaussian domination.

To get a better sense of what is happening, suppose that we are looking at the real-valued random variable

\displaystyle  S = \sum_{i=1}^n s_i

where each {s_i} is a independent random variable that is equal to 1 with probability {p} and equal to zero with probability {1-p}. Suppose that {p = 2^{-\ell}} is a negative power of two and that {pn > n^{-\Omega(1)}}, let’s say {pn > \sqrt n}.

We would like to say that there is a {t = O(\sqrt{np \log n})} such that

\displaystyle  \Pr [ S -\mathop{\mathbb E} S > t ] \leq \frac 1{n^{\Omega(1)}}

Also, we have made a vow to only bound the tail of discrete random variables by showing that they are sub-Gaussian and then using the tail of the dominating Gaussian.

If we try to argue about the sub-Gaussianity of {S-\mathop{\mathbb E} S}, we are in trouble, because there is probability {p^n} that {S=pn}, so that {S-\mathop{\mathbb E} S \geq \frac n2}. This is problematic because, in a Gaussian distribution, a deviation that occurs with probability {p^n} can only be {O(\sqrt {n \log 1/p})} times the standard deviation, so the standard deviation has to be {\Omega (\sqrt{ n / \log 1/p})} and a deviation that is achieved with probability {1/n^{\Theta(1)}} is of the order at least {\Omega \left( \sqrt {\frac{n\log n} {\log 1/p}} \right)} which is much more than the {O(\sqrt{np \log n})} that we were hoping for.

The problem is that a Gaussian distribution with standard deviation of the order of {\sqrt {pn}} dominates our distribution in the regime we are interested in, but not in all regimes.

To overcome this problem, we write

\displaystyle  S - \mathop{\mathbb E} S = \sum_{k=1}^\ell S^{(k)} - S^{(k-1)}


\displaystyle  S^{(k)} = \sum_i s^{(k)}_i

and the random variables {s^{(k)}_i} are constructed in the following way: {s_i^{(0)} = p}, and {s^{(k)}_i} is equally likely to be either 0 or {2 s_i^{(k-1)}}. In this way, {S^{(0)} = \mathop{\mathbb E} S} and {S^{(\ell)} = S}.

For every choice of the {s^{(k-1)}_i}, we can write

\displaystyle  S^{(k)} - S^{(k-1)} = \sum_i r^{(k)}_i s^{(k-1)}_i

where the {r^{(k)}_i } are independent Rademacher random variable. Over the randomness of the {r^{(k)}_i }, the random variable {S^{(k)} - S^{(k-1)} } has a sub-Gaussian distribution dominated by a Gaussian of variance

\displaystyle  \sum_i \left( s^{(k-1)}_i \right)^2 \leq 2^k p \cdot \sum_i s^{(k-1)}_i

Now, without breaking our vow, we can inductively get a high-probability estimate that for each {k}

\displaystyle  S^{(k)} - S^{(k-1)} \leq O(\sqrt{ 2^k p^2 n \log n})

while maintaining the invariant that, say, { \sum_i s^{(k-1)}_i \leq 2pn} for each {k}. When we sum up the error bounds, the whole sum is of the order of the last term, which is {O(\sqrt {pn \log n})}.

5 thoughts on “Spielman-Srivastava Sparsification à la Talagrand

  1. Hi Luca. I think you want L_e instead of L_G at the beginning of Section 3.

    Rudelson’s original proof of concentration for i.i.d. sums of rank one matrices (applied by Spielman-Srivastava) actually used a chaining argument (see here: https://arxiv.org/abs/math/9608208).

    His passage to Rademacher sum is similar to yours, but perhaps more elegant (one of those symmetry tricks). If X is a random variable, and X,X' are i.i.d. copies of X, then \mathbb{E} \|X-\mathbb{E} X\| \leq \mathbb{E} \|X-X'\|. See the first sequence of inequalities in Section 2, where his random variables {y_i} are sampled from: Choose y_e = L_G^{-1/2} z_e/\sqrt{p_e} with probability p_e=R_e/(n-1).

  2. About your “Final comment:” What’s going on here, I think, is that you are not measuring deviations in a convex way, which means that you can select a “band” of the tail on which your random variable is not dominated by the corresponding Gaussian. If you are willing to measure using only convex tests, you still get domination once you symmetrize.

    Here is Lemma 6.3 from the Ledoux-Talagrand book:

    Suppose F : \mathbb{R}_+ \to \mathbb{R}_+ is convex, \{X_i\} is a sequence of independent, mean 0 random variables taking values in a normed space, and \{\varepsilon_i\} are independent random signs. Then:
    \mathbb E F\left(\frac12 \| \sum_i \varepsilon_i X_i \|\right) \leq \mathbb E F\left(\|\sum_i X_i\|\right) \leq \mathbb E F\left(2 \|\sum_i \varepsilon_i X_i\|\right).

    In this case, your norm is the operator norm A \mapsto \max_{x^T L_G x=1} x^T A x.

  3. Can’t edit comments on wordpress… there should be a norm inside the first F. Note that when you try to use a test F(y)=y^q to detect an event of measure p^n, you have to take q large, and then the factor 2^q kills you.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s