This is the second in a series of posts explaining a result on hypergraph sparsification that uses Talagrand’s work on Gaussian processes.
In the previous post we talked about Gaussian and sub-Gaussian processes and generic chaining.
In this post we talk about the Spielman-Srivastava probabilistic construction of graph sparsifiers. Their analysis requires a bound on the largest eigenvalue of a certain random matrix, that can be derived from matrix Chernoff bounds.
We will then make our life harder and we will also derive an analysis of the Spielman-Srivastava construction by casting the largest eigenvalue of that random matrix as the sup of a sub-Gaussian process, and then we will apply the machinery from the previous post.
This will be more complicated than it needs to be, but the payoff will be that, as will be shown in the next post, this more complicated proof will also apply, with some changes, to the setting of hypergraphs.
1. Graph Sparsifiers
For a graph having vertex set
, if
is a subset of vertices, we denote by
the number of edges that cross the cut
, that is, that have one endpoint in
and one endpoint outside
. If
is a weighted graph,
is the sum of the weights of such edges.
A graph is a cut sparsifier of
with error at most
if
and
have the same vertex set
and, for every
, we have
that is, if all cuts in and
are the same, up to a multiplicative error
. (We use the convention that
.) This definition was introduced by Benczur and Karger who showed that every graph has a cut sparsifier with at most
edges, where
is the number of vertices, and that such a sparsifier can be constructed by an efficient probabilistic algorithm.
This is a useful construction because if one wants to run on an algorithm that approximately solves a problem that depends on the cut structure of
(for example min cut, min st-cut, max flow, or an algorithm for a clustering problem), then one may alternatively run such algorithm on
, and an approximate solution for
will also be an approximate solution for
. The advantage is that one runs the algorithm on a graph that has fewer edges, and so one needs less time and less memory to run the algorithm.
The approach of Benczur and Karger is to assign to each edge of
a probability
, in a certain careful way that we will not describe. Then they generate
by doing the following independently for every edge
: with probability
we put the edge
in
, and give it weight
, and with probability
we do not put the edge
in
. For every cut
, we have that
is a random variable with expectation
, and Benczur and Karger are able to use Chernoff bounds and a union bound to show that there is a setting of those probabilities such that it is likely that all cuts will be preserved with multiplicative error at
and such that
. The union bound has to be done very carefully and, in particular, one has to use the fact that there can be few sparse cuts.
Spielman and Teng introduced the stronger definition of spectral sparsifier: according to their definition, a graph is a spectral sparsifier of
with error at most
if, for every vector
we have
where is the Laplacian matrix of
and, as before, we take the convention that
. This is a stronger condition because, if
is the indicator vector of a set
, then
, so the Benczur-Karger condition is equivalent to the special case of the Spielman-Teng condition in which we only quantify over Boolean vectors
.
Spielman and Teng gave an efficient construction of spectral sparsifiers with edges.
Later, Spielman and Srivastava reduced the number of edges to , with a proof similar to that of Benczur and Karger’s: they attribute a probability
to each edge, and then sample each edge
with probability
, weighing it
if selected. The Laplacian
of the resulting weighted graph satisfies
, and so one has to study the concentration of the random matrix
around its average, which is doable using matrix Chernoff bounds.
More recently, Batson, Spielman and Srivastava gave a construction of spectral sparsifiers with edges. Their construction is deterministic, and it proceeds by choosing one edge at a time, in a process that is driven by a certain potential function. Allen-Zhu, Liao and Orecchia have presented an interpretation of Batson-Spielman-Srivastava as an online optimization game played using a Follow-the-Regularized-Leader strategy. As my sequence of posts on online optimization will continue after the current hiatus, I plan to present the results of Allen-Zhu, Liao and Orecchia.
2. The Spielman-Srivastava Construction
We will assume that the graph is connected, otherwise we can apply the construction below to each connected component.
In the Spielman-Srivastava construction, we assign a probability to each edge, and then we want to say that the event
holds with high probability, where is the random graph obtained by sampling each edge
of
independently with probability
, and weighing it
if selected, and
is the Laplacian of
.
Actually, Spielman and Srivastava sample from a slightly different distribution. Namely they repeatedly sample from a distribution on all edges, in which edge has probability proportional to
, because this distribution is easier to analyze with matrix Chernoff bounds, but we will proceed to analyze the distribution described in the above paragraph.
Matrix Chernoff bounds give upper bounds to the probability that a sum of independent random matrices deviates in operator norm from its average, and take the form of
where the are independent random matrices. If the matrices are
Hermitian, the most general case is given by the matrix Bernstein bound
where
is the “variance” of , and
is an upper bound such that
holds with probability one.
It remains to reformulate (1) in a form like (2). We first note that
where is the Laplacian matrix of the edge
. That is, if
, then
is the rank-1 matrix whose quadratic form is
.
So we can write
where is a random variable that is equal to
with probability
and to
with probability
.
We also note that all the terms in (1) are invariant under shifts in , so we can rewrite the event (1) as
Where is the set of vectors
that are orthogonal to
. If we apply the change of variable
, which is bijective on
, the previous event becomes
which is implied by (and actually equivalent to):
Looking at the matrix Bernstein bound, we want to choose the so that, say,
holds with probability 1.
The variance term (7) is the spectral norm of
where we computed and we used the fact that if
is a rank-1 real-valued symmetric matrix then
.
So we have
If we set
then we see that we satisfy both (7) and (8).
The term is the effective resistance of the edge
and it is usually denoted as
. It is known that
for connected graphs. Thus we have
as promised.
3. Analysing the Construction as a Sub-Gaussian Process
Now we would like to provide an analysis of the Spielman-Srivastava construction in terms of bounding the sup of a sub-Gaussian process via the Talagrand comparison inequality. Indeed we can think of our goal as showing that the following two random variables are both with high probability:
where the are the random variables defined in the previous section. Unfortunately, while random processes that involve weighted sums of Rademacher random variables are well suited to such an analysis, random processes that involve highly biased Boolean random variables do not work very well in this framework.
To avoid this problem, we will think of performing the Spielman-Srivastava sampling in phases, such that each phase involves unbiased Boolean random variables.
To avoid having to deal with too many constants, we will think of setting , with a goal of achieving sparsification with error
. We will want to think of the process of sampling an edge as a sequence of unbiased Boolean choices, so it will be convenient to round up probabilities to non-positive powers of 2. So we will set edge probabilities such that
for some integer
and such that
If we let , we will think of the process of sampling
as proceeding in
rounds. If
then, in each of the last
round (that is, in round
through
), we choose with probability 1/2 to delete the edge
and with probability 1/2 to double its weight.
(Why do we do it in the last rounds and not in the first
rounds? This issue confused me a lot at some point. Hold on to this question until later.)
Let us call the graph obtained after round
, so that
and
. We see that
Let us now understand the quadratic form in each of the above term. If we let be the weight of edge
in graph
, we have
Regarding the weights, if we consider an edge such that
, we have that the edge
is left untouched in round
if
, that is, if
. In that case,
. If
, then
is equally likely to be
or
. In other words,
where
is a Rademacher random variable.
Putting everything together, we have
and now we are in good shape because Rademacher sums are well suited to be analyzed as sub-Gaussian processes. In particular, we will able to prove the following lemma.
Lemma 1 There are bounds
such that for every outcome of the first
steps that satisfies
, we have
Applying the lemma inductively, we see that we have probability at least that
as desired, and it remains to observe that in every undirected graph every edge has effective resistance at least so that
.
It will be convenient to do a change of variables and write
To lighten up the notation in our subsequent arguments, it will be convenient to give names to the matrices that we obtain after this change of basis, and call
In this notation, we have
and recall that we assumed
With this notation, the quantity in (9) that we want to bound becomes
where
is a centered random process.
Define the centered Gaussian process
where are independent standard normal random variables. Then we immediately see that both
and
are
-dominated by
, because
and
In order to deduce the lemma from the Talagrand comparison inequality it suffices to show that
and to have a bound on the diameter
To bound the average supremum of the Gaussian process we could use generic chaining, but fortunately there is a a matrix Chernoff bound that say that if are real-valued symmetric matrices and
are independent standard normal random variables then
Applied to our setting,
where
so indeed
Now we can return to the question about the order in which we process the edges. By starting from the lowest-probability edges, we are also starting from the edges for which is smallest. When we bound
and the sum is over the edges that are processed at round
, it is convenient to be able to say that
has a round-dependent upper bound. Indeed, if
is processed at round
, then
is either zero or
, so that
, and the terms
add up to
when summed over rounds. If we had proceeded in the opposite direction, we would have only been able to bound
as an absolute constant times
, and we would have lost a factor of the order of the number of rounds in the analysis.
The diameter is bounded by analogous considerations. The square of the diameter
and we have
and
Which gives the desired bound
Now the Talagrand comparison inequality gives us the lemma, and hence the analysis of the Spielman-Srivastava construction.
4. A Final Comment
There was a point that confused me for a while about this argument. Namely, we are not able to study
by showing that
is dominated by a Gaussian process, because involves biased Boolean random variables which yield poor bounds when we try to dominate them by a Gaussian distribution. Instead we write
and then we show how to dominate each term on the right-hand side by a Gaussian process. But then won’t the sum of those Gaussian processes dominate , which was not supposed to be possible?
But the point is that in the analysis of we throw away the low-probability case that, in the previous processes, we ended with
. These low-probability events that we remove from consideration are enough to cut off the problematic tail of the discrete distribution that does not have a good Gaussian domination.
To get a better sense of what is happening, suppose that we are looking at the real-valued random variable
where each is a independent random variable that is equal to 1 with probability
and equal to zero with probability
. Suppose that
is a negative power of two and that
, let’s say
.
We would like to say that there is a such that
Also, we have made a vow to only bound the tail of discrete random variables by showing that they are sub-Gaussian and then using the tail of the dominating Gaussian.
If we try to argue about the sub-Gaussianity of , we are in trouble, because there is probability
that
, so that
. This is problematic because, in a Gaussian distribution, a deviation that occurs with probability
can only be
times the standard deviation, so the standard deviation has to be
and a deviation that is achieved with probability
is of the order at least
which is much more than the
that we were hoping for.
The problem is that a Gaussian distribution with standard deviation of the order of dominates our distribution in the regime we are interested in, but not in all regimes.
To overcome this problem, we write
where
and the random variables are constructed in the following way:
, and
is equally likely to be either 0 or
. In this way,
and
.
For every choice of the , we can write
where the are independent Rademacher random variable. Over the randomness of the
, the random variable
has a sub-Gaussian distribution dominated by a Gaussian of variance
Now, without breaking our vow, we can inductively get a high-probability estimate that for each
while maintaining the invariant that, say, for each
. When we sum up the error bounds, the whole sum is of the order of the last term, which is
.
Thank you!
Hi Luca. I think you want L_e instead of L_G at the beginning of Section 3.
Rudelson’s original proof of concentration for i.i.d. sums of rank one matrices (applied by Spielman-Srivastava) actually used a chaining argument (see here: https://arxiv.org/abs/math/9608208).
His passage to Rademacher sum is similar to yours, but perhaps more elegant (one of those symmetry tricks). If
is a random variable, and
are i.i.d. copies of
, then
. See the first sequence of inequalities in Section 2, where his random variables {y_i} are sampled from: Choose
with probability
.
About your “Final comment:” What’s going on here, I think, is that you are not measuring deviations in a convex way, which means that you can select a “band” of the tail on which your random variable is not dominated by the corresponding Gaussian. If you are willing to measure using only convex tests, you still get domination once you symmetrize.
Here is Lemma 6.3 from the Ledoux-Talagrand book:
Suppose
is convex,
is a sequence of independent, mean 0 random variables taking values in a normed space, and
are independent random signs. Then:
.
In this case, your norm is the operator norm
.
Can’t edit comments on wordpress… there should be a norm inside the first F. Note that when you try to use a test
to detect an event of measure
, you have to take
large, and then the factor
kills you.
Thanks a lot for writing those, Luca!