The Spectral Partitioning Algorithm

[In which we prove the “difficult part” of Cheeger’s inequality by analyzing a randomized rounding algorithm for a continuous relaxation of sparsest cut.]

We return to this month’s question: if G=(V,E) is a d-regular graph, how well can we approximate its edge expansion h(G) defined as

h(G) := \min_{S\subseteq V} \frac{edges(S,V-S)} {\min \{ |S|, \ |V-S| \} }

and its sparsest cut \phi(G) defined as

\phi(G) := \min_{S\subseteq V} \frac{edges(S,V-S)} { \frac 1n \cdot |S| \cdot |V-S|} =  \min_{x\in \{0,1\}^n } \frac{ \sum_{i,j} A(i,j) \cdot |x(i)-x(j)|}{\frac 1n  \sum_{i,j}  |x(i)-x(j)|},

where A is the adjacency matrix of G.

We have looked at three continuous relaxations of \phi(G), the spectral gap, the Leighton-Rao linear program, and the Arora-Rao-Vazirani semidefinite program.

As we saw, the spectral gap of G, defined as the difference between largest and second largest eigenvalue of A, can be seen as the solution to a continuous optimization problem:

d-\lambda_2 = \min_{x\in {\mathbb R}^n} \frac {\sum_{i,j} A(i,j) \cdot |x(i)-x(j)|^2}{\frac 1n  \sum_{i,j}  |x(i)-x(j)|^2}.

It follows from the definitions that

d-\lambda_2 \leq \phi \leq 2h

which is the “easy direction” of Cheeger’s inequality, and the interesting thing is that d-\lambda_2 is never much smaller, and it obeys

(1) \ \ \ d-\lambda_2 \geq \frac {h^2}{2d} \geq \frac {\phi^2}{8d},

which is the difficult part of Cheeger’s inequality. When we normalize all quantities by the degree, the inequality reads as

\frac 1 8 \cdot \left( \frac \phi d \right)^2 \leq \frac 1 2 \cdot \left( \frac h d \right)^2 \leq \frac{d-\lambda_2}{d} \leq \frac \phi d  \leq 2 \frac h d .

I have taught (1) in three courses and used it in two papers, but I had never really understood it, where I consider a mathematical proof to be understood if one can see it as a series of inevitable steps. Many steps in the proofs of (1) I had read, however, looked like magic tricks with no explanation. Finally, however, I have found a way to describe the proof that makes sense to me. (I note that everything I will say in this post will be completely obvious to the experts, but I hope some non-expert will read it and find it helpful.)

We prove (1), as usual, by showing that given any x\in {\mathbb R}^n such that

(2) \ \ \ \frac {\sum_{i,j} A(i,j) \cdot |x(i)-x(j)|^2}{\frac 1n  \sum_{i,j}  |x(i)-x(j)|^2} = \epsilon

we can find a threshold t such that the cut (S,V-S) defined by S:= \{ i: x_i \geq t \} satisfies

(3)  \ \ \ \frac{edges(S,V-S)} { \min \{ |S|,\ |V-S| \} } \leq \sqrt{2 d \epsilon}

and

\frac{edges(S,V-S)} { \frac 1n \cdot |S| \cdot |V-S| } \leq \sqrt{8 d \epsilon}.

This not only gives us a proof of (1), but also an algorithm for finding sparse cuts when they exist: take a vector x which is an eigenvector of the second eigenvalue (or simply a vector for which the Rayleigh quotient in (2) is small), sort the vertices i according to the value of x(i), and find the best cut among the “threshold cuts.” This is the “spectral partitioning” algorithm.

This means that proving (1) amounts to studying an algorithm that “rounds” the solution of a continuous relaxation to a combinatorial solution, and there is a standard pattern to such arguments in computer science: we describe a randomized rounding algorithm, study its average performance, and then argue that there is a fixed choice that is at least as good as an average one. Here, in particular, we would like to find a distribution T over threshold, such that if we define S:= \{ i: x(i) \geq T\} as a random variable in terms of T we have

{\mathbb E} [ edges(S,V-S) ] \leq {\mathbb E}  [\min \{ |S|,\ |V-S| \} ] \cdot  \sqrt{2 d \epsilon}

and so, using linearity of expectation,

{\mathbb E} [ edges(S,V-S)  - \min \{ |S|,\ |V-S| \}  \cdot  \sqrt{2 d \epsilon}] \leq 0

from which we see that there must be a threshold in our sample space such that (3) holds. I shall present a proof that explicitly follows this pattern. It would be nice if we could choose T uniformly at random in the interval \left[\min_i x(i), \max_i x(i)\right], but I don’t think it would work. (Any reader can see a counterexample? I couldn’t, but we’ll come back to it.) Instead, the following works: assuming with no loss of generality that the median of \{ x(1),\ldots,x(n) \} is zero, T can be chosen so that |T|\cdot T is distributed uniformly in the interval \left[\min_i x(i), \max_i x(i)\right]. (This means that thresholds near the median are less likely to be picked than thresholds far from the median.) This choice seems to be a magic trick in itself, voiding the point I made above, but I hope it will become natural as we unfold our argument.

First, we shall see that there are continuous relaxations of h and \phi that exactly match the combinatorial optimum, and that this can be established via random rounding.

Our first claim is that

(4) \ \ \ \phi(G) = \min_{x \in {\mathbb R}^n} \frac{ \sum_{ij} A(i,j) | x(i)-x(j) |}{\frac 1n \sum_{ij}  | x(i)-x(j) |}

To prove (4), take an optimal solution x\in {\mathbb R}^n, pick a random threshold t in the interval \left[\min_i x(i),\max_i x(i)\right], and define S:= \{ i: x(i) \geq t\}. Then it should be clear that the probability that an edge (i,j) is cut by (S,V-S) is proportional to |x(i)-x(j)|. Normalizing x so that \max_i x_i - \min_i x_i = 1, which does not affect the distribution of S, nor the optimality of x, the probability is exactly equal to |x(i)-x(j)|, and we have

{\mathbb E} [edges (S,V-S)] = \frac 12 \sum_{ij} A(i,j) | x(i)-x(j) |

and

{\mathbb E} [|S| \cdot |V-S| ] = \frac 12 \sum_{ij}  | x(i)-x(j) |

so that the support of our distribution includes a cut (S,V-S) such that

\frac{edges(S,V-S)} {\frac 1n |S| \cdot |V-S|} \leq \frac{ \sum_{ij} A(i,j) | x(i)-x(j) |}{\frac 1n  \sum_{ij}  | x(i)-x(j) |}

Next, we want to argue that

(5) \ \ \ h(G) = \min_{x \in {\mathbb R}^n,\ med(x)=0} \frac{ \sum_{ij} A(i,j) | x(i)-x(j) |}{2\sum_i |x(i)|}

where med(x) denotes the median of the values \{ x(1),\ldots,x(n)\}. It may take a minute to see that the right-hand side is a relaxation of h(G). Suppose that (S,V-S) is the optimal cut in the definition of h, and that |S| \leq n/2. Then define x(i)=1 if i\in S and x(i)=0 otherwise. The median of x is zero, |S| = \sum_i |x(i)|, and \sum_{ij} A(i,j) | x(i)-x(j) | counts twice every edge crossing the cut. So the right-hand-side is a relaxation, and can be no larger than h(G).

As before, normalize x so that \max_i x_i - \min_i x_i = 1, pick a threshold t uniformly at random in this interval, and define S = \{ i: x(i) \geq t\}. We already argued that

{\mathbb E} [ edges (S,V-S) ] = \frac 12 \sum_{ij} A(i,j) | x(i)-x(j)|

and it remains to show that

{\mathbb E} [\min \{ |S| , \ |V-S| \} ] = \sum_i |x_i| .

This is just a matter of writing things down. If we name the vertices so that x(1)\leq x(2) \leq \cdots x(n), and we use that for a non-negative integral random variable N we have {\mathbb E} [N] = \sum_k {\mathbb P} [N\geq k], then we just have to observe that

{\mathbb P} [\min \{ |S| , \ |V-S| \}  \geq k] = | x(n-k) - x(k)|= |x(n-k)| + |x(k)|.

Now that we have (4) and (5), if someone gives us an x \in {\mathbb R}^n such that

(6) \ \ \ \frac { \sum_{i,j} A(i,j) \cdot | x(i) - x(j)|^2 }{ \frac 1n \sum_{ij} |x(i)-x(j)|^2} = \epsilon

our first thought would be to plug x into the right-hand side of (4) and (5) and hope for the best. Unfortunately, I suspect that such hope would be unwarranted. We can see, however, that

\sum_{i,j} A(i,j) \cdot | x(i) - x(j)|
\leq \sqrt{\sum_{i,j} A(i,j) \cdot | x(i) - x(j)|^2 } \cdot \sqrt{\sum_{i,j} A(i,j)}
= \sqrt{nd \cdot \sum_{i,j} A(i,j) \cdot | x(i) - x(j)|^2}

and

\sum_{ij} |x(i) - x(j)| \geq \frac 1 {\max_{ij} |x(i)-x(j)|} \cdot \sum_{ij} |x(i) - x(j)|^2

so, if (6) holds, we have

\frac{\sum_{ij} A(i,j) |x(i) - x(j)|}{\frac 1n \sum_{ij} |x(i)-x(j)|} \leq \sqrt{\epsilon d} \cdot \frac{\max_{ij} |x(i)-x(j)|}{\sqrt{ \frac 1 {n^2} \sum_{ij} |x(i)-x(j)|}}

and, at least in the special case in which \sum_{ij} |x(i)-x(j)| = \Omega(n^2 \cdot \max_{ij}|x(i)-x(j)|), this gives us a cut of expansion O(\sqrt{\epsilon d}). I don’t think, however, that just using x would work in the general case.

(Note that this special case in which the very simple rounding works is also the case in which the simple rounding gives, on average, a nearly balanced partition. Interestingly, both the Leighton-Rao and the Arora-Rao-Vazirani analyses reduce the general case to a case in which the respective rounding algorithms produce nearly balanced partitions.)

The next try would be to find a y such that |y(i)-y(j)| is an approximation to |x(i)-x(j)|^2, and |x(i)|^2 is an approximation to |y(i)|. My first guess was to choose y(i) := |x(i)|\cdot x(i), and this works!

Let us see what the right-hand sides of (4) and (5) look like for such a solution y. First, we may assume without loss of generality that we are given an x such that the median of x is zero. (Otherwise we can subtract the median from every entry without changing (6).) This means that we have a solution y that is feasible for both (4) and (5).

In order to study the numerator of (6), we need to find a way to relate |x(i)-x(j)| to |y(i)-y(j)|. A couple of attempts give

|y(i) - y(j)| \leq |x(i)-x(j)| \cdot ( |x(i)| + |x(j)|)

and it’s clear that the next step has to be Cauchy-Schwarz

\quad \sum_{ij} A(i,j) \cdot |y(i) - y(j)|
\leq \sqrt{ \sum_{ij} A(i,j) |x(i)-x(j)|^2} \cdot \sqrt{ \sum_{ij} A(i,j) ( |x(i)| + |x(j)|)^2}

So we have related the numerator of (6) to the numerators of (4) and (5), up to the extra term that is easy to bound (recall the graph is d-regular) as

\ \ \sum_{ij} A(i,j) ( |x(i)| + |x(j)|)^2
\leq \sum_{ij} A(i,j) ( 2\cdot |x(i)| + 2\cdot |x(j)|)
= 4 d \sum_i |x(i)|^2
= 4d \sum_i |y(i)|

Now we use (6) to get

\ \ \ \sum_{ij} A(i,j) |x(i)-x(j)|^2
\leq \epsilon \frac 1n \sum_{i,j} |x(i)-x(j)|^2
= 2 \epsilon \sum_i x_i^2 - 2 \epsilon \left( \sum_i x(i) \right)^2
\leq 2\epsilon \sum_i |y(i)|

and, combining everything,

\frac{\sum_{ij} A(i,j) |y(i)-y(j)|} {2\sum_i |y(i)|} \leq \sqrt{2d\epsilon}.

Although it is unnecessary to complete the proof, we can also see that, because the median of y is zero, we have

\sum_{i,j} |y(i)-y(j)| \geq n \sum_i |y(i)|

and so

\frac{\sum_{ij} A(i,j) |y(i)-y(j)|} {\frac 1n \sum_{ij} |y(i)-y(j)|} \leq 2\cdot \sqrt{2d\epsilon}.

There is another way to look at this “rounding algorithm:” we see the solution x that we are given as defining a “distance function” d(i,j):= |x(i)-x(j)|^2 among the vertices, and we find a combinatorial solution by finding an r such that the cut is defined by the “ball of radius r” around the vertex i_{\min}, where x(i_{\min}) = \min_i x(i). (If t is the threshold picked by the algorithm, then r= |t-x(i_{\min})|.) Indeed, we pick a radius at random, with a certain distribution, and we are able to relate the probability that an edge i,j is cut with the value d(i,j). This view is not very natural for the spectral partitioning algorithm, but it is in the spirit of what happens in the rounding of the Leighton-Rao and the Arora-Rao-Vazirani relaxations.

6 thoughts on “The Spectral Partitioning Algorithm

  1. hi luca:

    in addition to the wonderful post(s), one thing that might be worth mentioning is a corresponding “cheating-cheeger’s inequality” in the balanced separator world. the advantage of this is that it is straightforward to prove and the quadratic dependence comes out, perhaps, in a much cleaner way. also, it might be the way to introduce cheeger’s inequality in a course.

    i am sure most of the readers of your blog have seen this, but for the others, here is a sketch of how it goes: for the sake of this discussion, assume that the graph is d-regular.

    let \lambda_2 denote the normalized (by d) second smallest eigenvalue of the laplacian.

    then \lambda_2 can be seen as the optimum of the following optimization problem (which arises as a relaxation to the sparsest cut problem in the natural way):

    assign a vector x_i in R^n to each vertex so as to minimize the following ratio:

    E_{e=ij} [||x_i-x_j||^2] / E_{i<j}[||x_i-x_j||^2]

    (the ratio of average l_2^2 distance across the edges to the average
    l_2^2 distance across all pairs of vertices).

    on the other hand, the corresponding relaxation to the balanced separator problem can be written as:

    assign a “unit” vector x_i in R^n to each vertex so as to minimize the following ratio:

    E_{e=ij} [||x_i-x_j||^2]

    such that E_{i<j}[||x_i-x_j||^2] = \Omega(1).

    the optimum of this program is the equivalent of \lambda_2 for the balanced separator problem.

    now suppose that this optima is less than \epsilon. then it is easy to show that there is a balanced cut in the graph which cuts <= O(\sqrt{\epsilon}) fraction of edges: simply pick a random hyperplane cut passing through
    the origin.

    since E_{e=ij} [||x_i-x_j||^2] <= \epsilon, the probability that a typical edge is cut is O(\sqrt{\epsilon}). here we use the fact that for an edge e=ij, with x_i, x_j being unit vectors with inner product 1-\epsilon/2 are separated with probability at most O(\sqrt{\epsilon}).

    the fact that the cut is balanced comes from the fact that on an average two vertices are a constant distance apart.

    -nisheeth

  2. Pingback: The Cheeger-Alon-Milman inequality « tcs math - some mathematics of theoretical computer science

  3. This is nice stuff. I hate to bug you, but I find this very hard to read. Is there any way of setting the text color to black, do you know?

  4. Pingback: 245B, Notes 1: Basic theory of expander graphs « What’s new

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s