This is the second in a series of posts explaining a result on hypergraph sparsification that uses Talagrand’s work on Gaussian processes.
In the previous post we talked about Gaussian and sub-Gaussian processes and generic chaining.
In this post we talk about the Spielman-Srivastava probabilistic construction of graph sparsifiers. Their analysis requires a bound on the largest eigenvalue of a certain random matrix, that can be derived from matrix Chernoff bounds.
We will then make our life harder and we will also derive an analysis of the Spielman-Srivastava construction by casting the largest eigenvalue of that random matrix as the sup of a sub-Gaussian process, and then we will apply the machinery from the previous post.
This will be more complicated than it needs to be, but the payoff will be that, as will be shown in the next post, this more complicated proof will also apply, with some changes, to the setting of hypergraphs.
1. Graph Sparsifiers
For a graph having vertex set , if is a subset of vertices, we denote by the number of edges that cross the cut , that is, that have one endpoint in and one endpoint outside . If is a weighted graph, is the sum of the weights of such edges.
A graph is a cut sparsifier of with error at most if and have the same vertex set and, for every , we have
that is, if all cuts in and are the same, up to a multiplicative error . (We use the convention that .) This definition was introduced by Benczur and Karger who showed that every graph has a cut sparsifier with at most edges, where is the number of vertices, and that such a sparsifier can be constructed by an efficient probabilistic algorithm.
This is a useful construction because if one wants to run on an algorithm that approximately solves a problem that depends on the cut structure of (for example min cut, min st-cut, max flow, or an algorithm for a clustering problem), then one may alternatively run such algorithm on , and an approximate solution for will also be an approximate solution for . The advantage is that one runs the algorithm on a graph that has fewer edges, and so one needs less time and less memory to run the algorithm.
The approach of Benczur and Karger is to assign to each edge of a probability , in a certain careful way that we will not describe. Then they generate by doing the following independently for every edge : with probability we put the edge in , and give it weight , and with probability we do not put the edge in . For every cut , we have that is a random variable with expectation , and Benczur and Karger are able to use Chernoff bounds and a union bound to show that there is a setting of those probabilities such that it is likely that all cuts will be preserved with multiplicative error at and such that . The union bound has to be done very carefully and, in particular, one has to use the fact that there can be few sparse cuts.
Spielman and Teng introduced the stronger definition of spectral sparsifier: according to their definition, a graph is a spectral sparsifier of with error at most if, for every vector we have
where is the Laplacian matrix of and, as before, we take the convention that . This is a stronger condition because, if is the indicator vector of a set , then , so the Benczur-Karger condition is equivalent to the special case of the Spielman-Teng condition in which we only quantify over Boolean vectors .
Spielman and Teng gave an efficient construction of spectral sparsifiers with edges.
Later, Spielman and Srivastava reduced the number of edges to , with a proof similar to that of Benczur and Karger’s: they attribute a probability to each edge, and then sample each edge with probability , weighing it if selected. The Laplacian of the resulting weighted graph satisfies , and so one has to study the concentration of the random matrix around its average, which is doable using matrix Chernoff bounds.
More recently, Batson, Spielman and Srivastava gave a construction of spectral sparsifiers with edges. Their construction is deterministic, and it proceeds by choosing one edge at a time, in a process that is driven by a certain potential function. Allen-Zhu, Liao and Orecchia have presented an interpretation of Batson-Spielman-Srivastava as an online optimization game played using a Follow-the-Regularized-Leader strategy. As my sequence of posts on online optimization will continue after the current hiatus, I plan to present the results of Allen-Zhu, Liao and Orecchia.
2. The Spielman-Srivastava Construction
We will assume that the graph is connected, otherwise we can apply the construction below to each connected component.
In the Spielman-Srivastava construction, we assign a probability to each edge, and then we want to say that the event
holds with high probability, where is the random graph obtained by sampling each edge of independently with probability , and weighing it if selected, and is the Laplacian of .
Actually, Spielman and Srivastava sample from a slightly different distribution. Namely they repeatedly sample from a distribution on all edges, in which edge has probability proportional to , because this distribution is easier to analyze with matrix Chernoff bounds, but we will proceed to analyze the distribution described in the above paragraph.
Matrix Chernoff bounds give upper bounds to the probability that a sum of independent random matrices deviates in operator norm from its average, and take the form of
where the are independent random matrices. If the matrices are Hermitian, the most general case is given by the matrix Bernstein bound
is the “variance” of , and is an upper bound such that
holds with probability one.
where is the Laplacian matrix of the edge . That is, if , then is the rank-1 matrix whose quadratic form is .
So we can write
where is a random variable that is equal to with probability and to with probability .
Where is the set of vectors that are orthogonal to . If we apply the change of variable , which is bijective on , the previous event becomes
which is implied by (and actually equivalent to):
Looking at the matrix Bernstein bound, we want to choose the so that, say,
holds with probability 1.
The variance term (7) is the spectral norm of
where we computed and we used the fact that if is a rank-1 real-valued symmetric matrix then .
So we have
If we set
The term is the effective resistance of the edge and it is usually denoted as . It is known that for connected graphs. Thus we have
3. Analysing the Construction as a Sub-Gaussian Process
Now we would like to provide an analysis of the Spielman-Srivastava construction in terms of bounding the sup of a sub-Gaussian process via the Talagrand comparison inequality. Indeed we can think of our goal as showing that the following two random variables are both with high probability:
where the are the random variables defined in the previous section. Unfortunately, while random processes that involve weighted sums of Rademacher random variables are well suited to such an analysis, random processes that involve highly biased Boolean random variables do not work very well in this framework.
To avoid this problem, we will think of performing the Spielman-Srivastava sampling in phases, such that each phase involves unbiased Boolean random variables.
To avoid having to deal with too many constants, we will think of setting , with a goal of achieving sparsification with error . We will want to think of the process of sampling an edge as a sequence of unbiased Boolean choices, so it will be convenient to round up probabilities to non-positive powers of 2. So we will set edge probabilities such that for some integer and such that
If we let , we will think of the process of sampling as proceeding in rounds. If then, in each of the last round (that is, in round through ), we choose with probability 1/2 to delete the edge and with probability 1/2 to double its weight.
(Why do we do it in the last rounds and not in the first rounds? This issue confused me a lot at some point. Hold on to this question until later.)
Let us call the graph obtained after round , so that and . We see that
Let us now understand the quadratic form in each of the above term. If we let be the weight of edge in graph , we have
Regarding the weights, if we consider an edge such that , we have that the edge is left untouched in round if , that is, if . In that case, . If , then is equally likely to be or . In other words, where is a Rademacher random variable.
Putting everything together, we have
and now we are in good shape because Rademacher sums are well suited to be analyzed as sub-Gaussian processes. In particular, we will able to prove the following lemma.
Lemma 1 There are bounds such that for every outcome of the first steps that satisfies , we have
Applying the lemma inductively, we see that we have probability at least that
as desired, and it remains to observe that in every undirected graph every edge has effective resistance at least so that .
It will be convenient to do a change of variables and write
To lighten up the notation in our subsequent arguments, it will be convenient to give names to the matrices that we obtain after this change of basis, and call
In this notation, we have
and recall that we assumed
With this notation, the quantity in (9) that we want to bound becomes
is a centered random process.
Define the centered Gaussian process
where are independent standard normal random variables. Then we immediately see that both and are -dominated by , because
In order to deduce the lemma from the Talagrand comparison inequality it suffices to show that
and to have a bound on the diameter
To bound the average supremum of the Gaussian process we could use generic chaining, but fortunately there is a a matrix Chernoff bound that say that if are real-valued symmetric matrices and are independent standard normal random variables then
Applied to our setting,
Now we can return to the question about the order in which we process the edges. By starting from the lowest-probability edges, we are also starting from the edges for which is smallest. When we bound
and the sum is over the edges that are processed at round , it is convenient to be able to say that has a round-dependent upper bound. Indeed, if is processed at round , then is either zero or , so that , and the terms add up to when summed over rounds. If we had proceeded in the opposite direction, we would have only been able to bound as an absolute constant times , and we would have lost a factor of the order of the number of rounds in the analysis.
The diameter is bounded by analogous considerations. The square of the diameter
and we have
Which gives the desired bound
Now the Talagrand comparison inequality gives us the lemma, and hence the analysis of the Spielman-Srivastava construction.
4. A Final Comment
There was a point that confused me for a while about this argument. Namely, we are not able to study
by showing that
is dominated by a Gaussian process, because involves biased Boolean random variables which yield poor bounds when we try to dominate them by a Gaussian distribution. Instead we write
and then we show how to dominate each term on the right-hand side by a Gaussian process. But then won’t the sum of those Gaussian processes dominate , which was not supposed to be possible?
But the point is that in the analysis of we throw away the low-probability case that, in the previous processes, we ended with . These low-probability events that we remove from consideration are enough to cut off the problematic tail of the discrete distribution that does not have a good Gaussian domination.
To get a better sense of what is happening, suppose that we are looking at the real-valued random variable
where each is a independent random variable that is equal to 1 with probability and equal to zero with probability . Suppose that is a negative power of two and that , let’s say .
We would like to say that there is a such that
Also, we have made a vow to only bound the tail of discrete random variables by showing that they are sub-Gaussian and then using the tail of the dominating Gaussian.
If we try to argue about the sub-Gaussianity of , we are in trouble, because there is probability that , so that . This is problematic because, in a Gaussian distribution, a deviation that occurs with probability can only be times the standard deviation, so the standard deviation has to be and a deviation that is achieved with probability is of the order at least which is much more than the that we were hoping for.
The problem is that a Gaussian distribution with standard deviation of the order of dominates our distribution in the regime we are interested in, but not in all regimes.
To overcome this problem, we write
and the random variables are constructed in the following way: , and is equally likely to be either 0 or . In this way, and .
For every choice of the , we can write
where the are independent Rademacher random variable. Over the randomness of the , the random variable has a sub-Gaussian distribution dominated by a Gaussian of variance
Now, without breaking our vow, we can inductively get a high-probability estimate that for each
while maintaining the invariant that, say, for each . When we sum up the error bounds, the whole sum is of the order of the last term, which is .