Spectral Sparsification of Hypergraphs

In this post we will construct a “spectral sparsifier” of a given hypergraph in a way that is similar to how Spielman and Srivastava construct spectral graph sparsifiers. We will assign a probability {p_e} to each hyperedge, we will sample each hyperedge {e} with probability {p_e}, and we will weigh it by {1/p_e} if selected. We will then bound the “spectral error” of this construction in terms of the supremum of a Gaussian process using Talagrand’s comparison inequality and finally bound the supremum of the Gaussian process (which will involve matrices) using matrix Chernoff bounds. This is joint work with Nikhil Bansal and Ola Svensson.

1. Hypergraph Sparsifiers

An (undirected) hypergraph {H= (V,E)} is a collection {E} of subsets {e \subseteq V}, called hyperedges. A graph is the special case in which each set {e} has cardinality 2. For simplicity we will talk only about {r}-uniform hypergraphs, that is hypergraphs in which all hyperedges have the same cardinality {r} (the arguments below would work also in the non-uniform case in which all hyperedges have cardinality at most {r}). Hyperedges may have weights.

If {S} is a subset of vertices, a hyperedge {e} is cut by {S} if {e} has non-empty intersection with both {S} and {V-S}. We call {cut_H(S)} the number of hyperedges of {H} cut by {S}, or the total weight of such hyperedges if {H} is a weighted hypergraph.

We can then generalize the notion of Benczur-Karger sparsification, and say that {H'} is a cut sparsifier of {H} with error {\epsilon} if {H} and {H'} have the same vertex set {V} and

\displaystyle  \forall S\subseteq V: \ \ \ 1-\epsilon \leq \frac{cut_{H'}(S)}{cut_{H}(S)} \leq 1 + \epsilon

Kogan and Krauthgamer show that every {r}-uniform hypergraph {H} admits a cut sparsifier of error {\epsilon} with only {O(\epsilon^{-2} n (r + \log n))} weighted hyperedges. They are able to extend the argument of Benczur and Karger to hypergraphs, including arguing that there are few sparse cuts. They assign a probability {p_e} to each hyperedge, sample it with that probability, weighing it {1/p_e} if selected, and then use a union bound and Chernoff bounds to argue that all cuts are preserved.

Anand Louis has introduced a notion of hypergraph Laplacian in the following way. The Laplacian quadratic form of a hypergraph {H} is a function {Q_H : {\mathbb R}^V \rightarrow {\mathbb R}} defined as

\displaystyle  Q_H(x) = \sum_{e\in E} w(e) \ \max_{a,b\in e} \ (x_a - x_b)^2

where {w(e)} is the weight of the hyperedge {E}, or 1 if the hypergraph is unweighted.

This definition has the motivation that it recovers the Laplacian quadratic form {Q_G(x) = x^T L_Gx} in the case of graphs, and that if {x} is the indicator of a set {S} then {Q_H(x) = cut_H(S)}. Furthermore, one can define “eigenvalues” and “eigenvectors” of this hypergraph Laplacian and recover a Cheeger inequality and even higher-order Cheeger inequalities.

So it seems interesting to consider the following notion of sparsification: a hypergraph {H'} is a spectral sparsifier of {H} with error {\epsilon} if {H} and {H'} have the same vertex set and

\displaystyle  \forall x\in {\mathbb R}^V \ \ \ 1-\epsilon \leq \frac{Q_{H'}(x)}{Q_{H}(x)} \leq 1 + \epsilon

where, as before, the convention is that {0/0 = 1}.

Soma and Yoshida studied this question and gave a construction with {O(\epsilon^{-2} n^3\log n)} hyperedges. In the rest of this post we will discuss the construction by Nikhil Bansal, Ola Svensson and me, which uses {O(\epsilon^{-2} r^3 n \log n)} hyperedges.

2. Choosing Probabilities

Given a hypergraph {H=(V,E)}, we can construct a multigraph {G} by taking each hyperedge {e} of {H} and then constructing, in {G}, a clique between the vertices of {e}. Thus, in {G}, the edge {(a,b)} is repeated as many times (possibly, zero times) as the number of hyperedges {e} of {H} that contain both {a} and {b}. Another way to think about it is that the Laplacian of {G} is given by

\displaystyle  L_G = \sum_{e\in E} \sum_{a,b\in e} L_{a,b}

where the inner sum is over the {{r \choose 2}} unordered pairs {\{a,b\}}. This graph relates in several interesting ways with {H}. For example, if {S} is a subset of vertices and {e} is a hyperedge that is cut by {S}, then between {r-1} and {{r \choose 2}} of the edges of {G} derived from {e} are cut by {S}. If {e} is not cut by {S}, then none of the edges derived from {e} is cut by {S}, so we have

\displaystyle  \forall S\subseteq V: \ \ \ (r-1) \ cut_H(S) \leq cut_G(S) \leq {r \choose 2} cut_H(S)

and, with a bit more work, it is possible to prove similar bounds for the Laplacian quadratic forms

\displaystyle  \forall x\in {\mathbb R}^ V: \ \ \ \frac r2 \ Q_H(x) \leq x^T L_G x \leq {r \choose 2} Q_H(x)

This means that if we sparsify {G}, say with the Spielman-Srivastava construction, we obtain information about {Q_H}, up to multiplicative error {O(r)}. Now suppose that, as we sample edges of {G} to sparsify it in the Spielman-Srivastava way, we will also pick hyperedges of {H} (for example we pick {e} if at least one of its corresponding edges in {G} is picked), and we weigh them by the inverse of the probability of being selected. Then we may hope that if the Spielman-Srivastava sparsification of {G} is tuned to achieve error {\epsilon/r^2}, the hypergraph that we obtain will have error at most {\epsilon}. Indeed, this is roughly what happens, and we will be able to prove it by showing that the error in the hypergraph sparsification is dominated by the error in the “Gaussian version” of Spielman-Srivastava described in the previous post.

So we are going to assign to each hyperedge {e} a probability

\displaystyle  p_e = \min \left\{ 1 , B \cdot \sum_{a,b \in e} || L^{-1/2}_G L_{a,b} L^{-1/2} || \right\}

where the factor {B} will be chosen later to get the construction to work and {R_{a,b} := || L^{-1/2}_G L_{a,b} L^{-1/2} ||} is the effective resistance of the edge {(a,b)} of {G}.

In fact, as before, it will be helpful to have probabilities that are non-positive powers of two, so we will choose {p_e} to be a power of two such that

\displaystyle  \min \left\{ 1 , B \cdot \sum_{a,b \in e} R_{a,b} \right\} \leq p_e \leq \min \left\{ 1 , 2B \cdot \sum_{a,b \in e} R_{a,b} \right\}

We have

\displaystyle  \sum_e p_e \leq 2 B \sum_{e\in E} \sum_{a,b\in e} R_{a,b} \leq 2B \cdot (n-1)

Another fact that will become useful later (it will save us a factor of the order of {r} in the number of hyperedges in the construction) is that

\displaystyle  \max_{a,b \in e} R_{a,b} \leq \frac 2r \sum_{a,b \in e} R_{a,b}

3. A Discrete Random Process

Our construction of a hypergraph sparsifier {H'} will be to select each {e} independently with probability {p_e} and, if selected, weigh it by {1/p_e}. Our goal is to find an upper bound in probability, or even in expectation, on

\displaystyle  \sup_{x \in {\mathbb R}^n} \ \ \left | 1 - \frac{Q_{H'}(x)}{Q_H(x)} \right |

Recalling that {x^T L_G x \leq \frac{r^2}2 Q_H(x)}, we will study

\displaystyle  \sup_{x : x^T L_G x = 1} | Q_H(x) - Q_{H'} (x) |

Because if we can show that the above quantity is at most {2\epsilon /r^2}, then, for every {x},

\displaystyle  | Q_H(x) - Q_{H'} (x) | \leq \frac{2\epsilon}{r^2} \ x^T L_G x \leq \epsilon Q_H(x)

as desired.

If we define

\displaystyle  Q_e(x) = \max_{a,b \in E} x^T L_{a,b} x

then we are interested in the supremum in {T = \{ x : x^T L_G x = 1 \}} of

\displaystyle  F(x) = Q_H(x) - Q_{H'} (x) = \sum_e z_e Q_e (x)

where {z_e} is a random variable that is equal to {1/p_e-1} with probability {p_e} and is equal to {-1} with probability {1-p_e}.

As before, we will do the construction in rounds. If the smallest probability {p_e} is {2^{-\ell}}, then we will have {\ell} rounds. We start with {H^{(0)} = H} and then, at round {k}, we take all hyperedges {e} such that {p_e \leq 2^{k-1-\ell}} and, independently for each such hyperedge, we either delete it or we double its weight (with probability 1/2 in each case). The final hypergraph {H^{(\ell)}} is distributed like {H'}. We have the processes

\displaystyle  F^{(k)}(x) = Q_{H^{(k-1)}}(x) - Q_{H^{(k)}} (x) = \sum_{e\in E_k} r^{(k)}_e w^{(k-1)}_e Q_e (x)

where the random variables {r^{(k)}_e} are Rademacher, the weights {w^{(k-1)}_e} are either 0 or {2^{k-1-\ell} / p_e}, and {E_k} is the set of edges that are “active” in round {k}, that is, the set of hyperedges {e} such that {p_e \leq 2^{k-1-\ell}}.

For each hypergraph {H^{(k)}}, we will also consider its associated graph {G^{(k)}}, obtained by replacing each hyperedge with a clique. The Laplacian of {G^{(k)}} is

\displaystyle  L_{G^{(k)}} = \sum_e w^{(k)}_e \sum_{a,b\in e} L_{a,b}

We have the following lemma

Lemma 1 For every outcome of the first {k-1} rounds such that {L_{G^{(k-1)}} \preceq 2 L_G}, there is a probability at least {1-1/n} over the randomness of the {k}-th round that

\displaystyle  \sup_{x: x^T L_G x = 1} | F^{(k)} (x) | \leq O \left( \sqrt{2^{k-\ell} \frac { \log n}{Br} } \right)

and that

\displaystyle  L_{G^{(k)}} - L_{G^{(k-1)} } \preceq O \left( \sqrt{2^{k-\ell} \frac { \log n}{B} } \ \right) \cdot L_G

This means that we can take {B = O(\epsilon^{-2} r^3\log n)} and apply the above lemma inductively to argue that we have a high probability that {\sup_{x\in T} |F(x) | \leq 2\epsilon /r^2}, and so we get a hypergraph spectral sparsifier with error {\epsilon} and {O(\epsilon^{-2} r^3 n \log n)} hyperedges.

To prove the Lemma we will define the Gaussian process

\displaystyle  \hat F^{(k)} (x) = \sum_{e\in E_k} w^{(k-1)}_e \sum_{a,b\in e} g^{(k)}_{e,a,b} \ x^T L_{a,b} x

where the {g^{(k)}_{e,a,b}} are Gaussian. Notice the two differences between {F^{(k)}} and {\hat F^{(k)}}: we replaced the Rademacher choices with Gaussian choices, and we replaced

\displaystyle  Q_e(x) = \max _{a,b\in e} \ (x_a - x_b)^2

with

\displaystyle  \sum_{a,b\in e} x^T L_{a,b} x = \sum_{a,b \in e} (x_a - x_b)^2

Furthermore, we are doing a random choice for each pair within each hyperedge instead of just one random choice per hyperedge.

Fact 2 The random processes {F^{(k)} (x)} and {-F^{(k)} (x)} are {O(1)}-dominated by {\hat F(x)}.

Proof: For every {x,y} we have

\displaystyle  || F^{(k)} (x) - F^{(k)}(y) ||^2_{\Psi_2} = O(1) \cdot \sum_{e\in E_k} ( w^{(k-1)}_e Q_e (x) )^2

and

\displaystyle  (d(x,y))^2 = \mathop{\mathbb E} (\hat F^{(k)} (x) - \hat F^{(k)} (y))^2

\displaystyle  = \sum_{e\in E_k} (w^{(k-1)}_e)^2 \sum_{a,b\in e} ((x_a-x_b)^2 - (y_a-y_b)^2)^2

To complete the proof, we argue that

\displaystyle  (Q_e (x) - Q_e (y))^2 \leq \sum_{a,b\in e} ((x_a-x_b)^2 - (y_a-y_b)^2)^2

To verify the above inequality, assume {Q_e(x) \geq Q_e(y)}, otherwise the argument will be symmetric, and call {a',b'} the maximizer for {x} and {a'',b''} the maximizer for {y}. Then

\displaystyle  (Q_e (x) - Q_e (y))^2

\displaystyle  = ((x_{a'} - x_{b'})^2 - (y_{a''} - y_{b''})^2

\displaystyle  \leq ((x_{a'} - x_{b'})^2 - (y_{a'} - y_{b'})^2)^2

\displaystyle  \leq \sum_{a,b\in e} ((x_a-x_b)^2 - (y_a-y_b)^2)^2

\Box

It remains to estimate the expected sup

\displaystyle  \mathop{\mathbb E} \ \sup_{x\in T} \ \hat F^{(k)} (x)

and the diameter

\displaystyle  \sup_{x,y \in T} \sqrt{ \mathop{\mathbb E} ( \hat F^{(k)} (x) - \hat F^{(k)} (y))^2 }

where, recall, {T = \{ x : x^T L_G x = 1\}}.

With the usual change of variable we have

\displaystyle  \sup_{x \in T} \ \hat F^{(k)} (x)

\displaystyle  = \sup_{y: ||y||= 1} \ \hat F^{(k)} (L_G^{-1/2} y)

\displaystyle  = \left\| \sum_{e\in E_k} w^{(k-1)}_e \sum_{a,b\in e} g^{(k)}_{e,a,b} \ L_{G}^{-1/2} L_{a,b} L_G^{-1/2} \right \|

\displaystyle  = \left\| \sum_{e\in E_k} w^{(k-1)}_e \sum_{a,b\in e} g^{(k)}_{e,a,b} \ M_{a,b} \right \|

where, as before, we use the notation {M_{a,b} := L_{G}^{-1/2} L_{a,b} L_G^{-1/2} }, {M^{(k)} = L_{G}^{-1/2} L_{G^{(k)} } L_G^{-1/2} } and {M := L_{G}^{-1/2} L_G L_G^{-1/2} }.

By matrix Chernoff bounds for Gaussian sums of matrices,

\displaystyle  \mathop{\mathbb E} \left\| \sum_{e\in E_k} w^{(k-1)}_e \sum_{a,b\in e} g^{(k)}_{e,a,b} \ M_{a,b} \right \|

\displaystyle   \leq O(\sqrt{\log n}) \cdot \sqrt{\max_{e\in E_k, a,b\in e} \left\| w^{(k-1)}_e M_{a,b} \right\|} \sqrt{ \left\| \sum_{e\in E_k} \sum_{a,b\in e} w^{(k-1)}_e \ M_{a,b} \right \| } \ \ \ \ \ (1)

Recall that if {e\in E_k} is a hyperedge that is active at round {k}, then {p_e \leq 2^{k-1-\ell}} and {w^{(k-1)}_e \leq 2^{k-1-\ell} / p_e}, and we also have

\displaystyle  p_e \geq B \sum_{a,b\in e} ||M_{a,b}|| \geq B \frac r2 \max_{a,b\in e} ||M_{a,b}||

so that we have

\displaystyle  \max_{e\in E_k, a,b\in e} w^{(k-1)}_e ||M_{a,b} || \leq \frac{2^{k-1-\ell}} {Br}

The last term of (1) is the spectral norm of {M^{(k-1)}}, which is at most 2 by the assumption of the Lemma.

Collecting all the pieces, we have proved that

\displaystyle  \mathop{\mathbb E} \ \sup_{x\in T} \ \hat F^{(k)} (x) \leq O \left( \sqrt{\log n \cdot \frac {2^{k-\ell}}{B r}}\right)

We also need to bound the diameter of {T}, that is

\displaystyle  \sup_{x,y \in T} \sqrt{\mathop{\mathbb E} (\hat F^{(k)}(x) -\hat F^{(k)} (y) )^2 }

under the usual change of basis, for every {x,y} of length 1 we want to bound the square root of

\displaystyle  \mathop{\mathbb E}\left ( \sum_{e\in E_k} w^{(k-1)}_e \sum_{a,b} g_{a,e,b} x^T (M_{a,b} x - y^T M_{a,b} y)\right)^2

\displaystyle  = \sum_{e\in E_k} \left( w^{(k-1)}_e\right)^2 \sum_{a,b} (x^T M_{a,b} x - y^T M_{a,b} y)^2

\displaystyle  \leq \sum_{e\in E_k} \left( w^{(k-1)}_e\right)^2 \sum_{a,b} (x^T M_{a,b} x)^2 + (y^T M_{a,b} y)^2

\displaystyle  \leq \sum_{e\in E_k} \left( w^{(k-1)}_e\right)^2 \max_{a,b \in e} ||M_{a,b} || \sum_{a,b} (x^T M_{a,b} x + y^T M_{a,b} y)

where {w^{(k-1)}_e} is either zero or {\frac{2^{k-\ell-1}}{B \sum_{a,b\in e} ||M_e||}} and we can continue our chain of inequalities with

\displaystyle  \leq \frac{2^{k-\ell}}{Br} \sum_{e\in E_k} w^{(k-1)}_e \sum_{a,b} (x^T M_{a,b} x + y^T M_{a,b} y )

\displaystyle  = \frac{2^{k-\ell}}{Br} (x^T M^{(k-1)} x + y^T M^{(k-1)} y )

\displaystyle  \leq 4 \cdot \frac{2^{k-\ell}}{Br}

where we used again the assumption of the Lemma {|| M^{(k-1)} || \leq 2}.

To prove the last part of the lemma, we have to prove that with high probability

\displaystyle  || M^{(k)} - M^{(k-1)} || \leq O\left( \sqrt{ \frac {2^{k-\ell} \log n}{B}} \right)

We see that

\displaystyle  M^{(k)} - M^{(k-1)} = \sum_{e\in E_k} w^{(k-1)}_e r^{(k)}_e \sum_{a,b \in e} M_{a,b}

as noted before, each { w^{(k-1)}_e} is either zero or {\frac {2^{k-1-\ell}}{B\sum_{a,b\in e} ||M_{a,b}||}}, so

\displaystyle  \left \| w^{(k-1)}_e \sum_{a,b \in e} M_{a,b} \right \| \leq \frac {2^{k-1-\ell}}{B}

and the final claim follows from matrix Chernoff bounds.

Now we can put everything together. Applying our lemma inductively, we can say that with high probability

\displaystyle  \sup_{x: x^T L_G x = 1} |F(x)| \leq \sum_{k=1}^\ell \sup_{x: x^T L_G x = 1} |F^{(k)} (x)| \leq \sum_{k=1}^\ell O\left( \sqrt{ 2^{k-\ell} \frac{\log n}{Br} } \right) \leq O \left( \sqrt{ \frac{\log n}{Br} } \right)

and

\displaystyle  \forall k: L^{(k)} _G \preceq O \left( \sum_{i=1}^k \sqrt{ 2^{i-\ell} \frac{\log n}{B} } \right) \cdot L_G \preceq 2 L_G

provided that we choose {B} at least at absolute constant times {\log n}.

In particular, we can choose {B} to be an absolute constant times {\epsilon^{-2} r^3 \log n} and have

\displaystyle  \sup_{x: x^T L_G x = 1} |F(x)| \leq \frac{2}{r^2} \epsilon

which is the same as

\displaystyle  \forall x: \ \ |Q_{H} (x) - Q_{H'} (x) | \leq \frac{2}{r^2} \epsilon \cdot x^T L_G x \leq \epsilon\ Q_H(x)

So, with high probability, {H'} is a spectral sparsifier of {H} with error {\epsilon}, and it has {O(Bn) = O(\epsilon^{-2} r^3 n \log n)} hyperedges

4. Some Additional Thoughts

The dependence on {r} is definitely not optimal, particularly because when {r} is order of {n} we know from the work of Soma and Yoshida that we can do better. One place where we seem to lose is that, although we know

\displaystyle  Q_e(x) \leq O \left( \frac 1r \right) \ \sum_{a,b\in } x^T L_{a,b} x

we are only able to show that

\displaystyle  \sum_{e\in E_k} r^{(k)}_e w^{(k)}_e Q_e(x)

is {O(1)}-dominated by

\displaystyle  \sum_{e\in E_k} g^{(k)}_e w^{(k)}_e \sum_{a,b \in e } x^T L_{ab} x

rather than, as we could have hoped, {O(1/r)}-dominated. The difficulty is that the bound

\displaystyle  (Q_e(x) - Q_e(y))^2 \leq \sum_{a,b\in e} (x^T L_{ab} x - y^T L_{ab} y)^2

is sometimes tight. In order to do better, it seems necessary to do something a bit differently from what we do here.

The effective resistance of an edge {(a,b)} in a graph {G} can be written as

\displaystyle  \sup_{x\in {\mathbb R}^n} \ \ \frac{x^T L_{a,b} x}{x^T L_G x}

with the convention that {0/0 = 0}. So it seems reasonable that a good definition of effective resistance for an hyperedge {e} in a hypergraph {H} would be

\displaystyle  \sup_{x\in {\mathbb R}^n} \ \ \frac{Q_e(x)}{Q_H(x) }

One can argue that these “effective resistances” add up to {O(nr)}, but perhaps they add up to {O(n)}? If we sample according to those “effective resitances,” can we apply generic chaining directly to {\sum_{e\in E_k} w^{(k)}_e g^{(k)}_e Q_e(x)} without having to rely on a Gaussian process on matrices, for which we have Chernoff bounds?

2 thoughts on “Spectral Sparsification of Hypergraphs

  1. Apologies if I missed this, but is it clear why the matrix Chernoff bound approach that works for regular matrices fails for hypergraphs? I see that at one point you use the fact that certain matrices in question are rank-1 in the case of regular graphs, and this would not be true for the case of hypergraphs, but maybe there’s something else also?

  2. Matrix Chernoff bounds allow you to bound the spectral norm of a random matrix.

    After a change of basis, the condition that a randomly subsampled graph sparsifies another becomes the question of bounding the spectral norm of a random matrix, to which matrix Chernoff bounds apply (this is Spielman-Srivastava).

    The condition that a randomly subsampled hypergraph sparsifies another is the question of how large the supremum of a certain quantity can be, and that supremum is not the spectral norm of a matrix. Talagrand’s comparison inequality allows us to say that the supremum of that quantity is upper bounded by the spectral norm of a certain random matrix, provided that certain condition holds. We verify the conditions and then use matrix Chernoff bounds to upper the spectral norm of the random matrix.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s