The Limitations of the Spectral Partitioning Algorithm

We continue to talk about the problem of estimating the expansion of a graph G=(V,E), focusing on the closely related sparsest cut, defined as

\displaystyle \phi(G):= \min_{S} \frac {edges(S,V-S) } { \frac 1n \cdot |S| \cdot |V-S| }

The spectral paritioning algorithm first finds a vector x \in {\mathbb R}^n minimizing

(1) \ \ \  \displaystyle \frac{ \sum_{i,j} A(i,j) |x(i)-x(j)|^2 } { \frac 1n \cdot \sum_{i,j} |x(i)-x(j)|^2}

(where A is the adjacency matrix of G) and then finds the best cut (S,V-S) where S is of the form \{ i \in V: \ x(i) \geq t \} for a threshold t.

We proved that if the quantity in (1) is \epsilon and G is d-regular, then the algorithm will find a cut of sparsity at most \sqrt{8 \epsilon d}, and that if x is the eigenvector of the second eigenvalue, then it is an optimal solution to (1), and the cost of an optimal solution to (1) is a lower bound to \phi(G). This means that the algorithm finds a cut of sparsity at most \sqrt{8 \phi(G) d}.

Can the analysis be improved? Continue reading


Giving American Flags and Conceptual Contributions to Orphans

I am worried that, lately, I am agreeing too much with James Lee. I hope my friends will stage an intervention if I start saying things like “…this group acts by isometries on Hilbert space…”

I agree with his choice of favorite STOC’08 talks, but also with his comments on the statement on the importance of conceptual contributions in theoretical computer science, which was recently written by a group of distinguished theoreticians, was briefly discussed at the STOC’08 business meeting, and was commented upon here, here and here.

In Mr. Spritz Goes to Washington, Lisa and Homer help pass a bill to divert flights from over their house by tacking their air traffic bill on a “giving orphans American flags” bill.

I felt that the statement (and the subsequent public and private discussions) similarly mixes the unobjectionable with the potentially controversial. There have been two themes that are entirely agreeable.

One is the importance, and value, of simplicity in proofs. This goes without saying, and I think that the community does a good job in recognizing simple resolutions of long-standing open questions, as well as new simplified proofs of known results. In fact, I think it would be hard to find someone who suggests that, all other things being equal, a harder proof is preferable to a simple proof.

Another is the importance of introducing new concepts and models. Indeed, there would be no theoretical computer science without its “concepts,” and the field would die if it would stop innovating. Again, nobody could possibly argue against the importance of new definitions and models.

Then, there is something which is said in the statement in a way that is not self-evident:

Once understood, conceptual aspects tend to be viewed as obvious, which actually means that they have become fully incorporated in the worldview of the expert. This positive effect is actually a source of trouble in the evaluation process, because the evaluators forget that these contributions were not obvious at all before being made.

Indeed, our community should be warned of dismissing such contributions by saying “yes, but that’s obvious”; when somebody says such a thing, one should ask “was it obvious to you before reading this article?”

I think (and I echo things that were better said in comments that I linked to above) that if something looks obvious after reading a new paper for the first time (as opposed to looking back at a classical paper from twenty or more years ago), then the paper is not making a valuable conceptual contribution. The time that it takes to read a paper cannot be sufficient to “alter the worldview” of the reader: the concept must have been “in the air.”

To be sure, usually even the greatest discoveries are in the air when they are made, a point which I want to write about in a different post. The fundamental difference, however, is that reading about a good conceptual discovery for the first time is startling and exciting, like hearing a good joke for the first time. If an expert feels nothing when reading about a conceptual discovery, it is usually a very bad sign, although one can come up with various exceptions.

Often quoted exceptions are some early papers in the great foundations-of-cryptography revolution of 1982, especially the Goldwasser-Micali-Rackoff paper on Zero Knowledge, which ended up appearing in 1985. What was special about that line of work is that it was not in the air, but, rather, ahead of its time. Obviously, I wasn’t there, so I don’t know, but I guess that people at the time were not saying “this work is obvious,” but rather “what the hell is this?,” which is quite different.

A final remark is that when a paper presents a new model or problem (and I understand that these are the kind of papers that the statement refers to), there can be no “intrinsic” value attributed the model or the problem; the value will be in the understanding and general progress that will come by studying the model and finding solutions to the problem, and how this will connect with different problems, different models, and applications.

There seem to be only two ways to validate a new conceptual proposal: one is to present preliminary evidence that one can do interesting things with it. (I think, in this case, too much emphasis is put on achieving quantitative improvement; I am impressed enough to see known, hard, results, recovered in a completely different way.) The other is the “gut feeling” of the expert, which feels that the new ideas are the “right” way to think about the issue at hand.

It seems right to reject a paper that fails both tests.

Not Legislated Any More

Just before Valentine’s day 2004, San Francisco Mayor Gavin Newsom instructed city clerks to issue marriage licenses to same-sex couples. For the following month, couples from the City, and from all over California and beyond, lined up for blocks to get married, until a court order ruled that the mayor did not have the authority to issue the order.

The city and some of the couples counter-sued, and a complicated case (which eventually consolidated six different suits) made its way up the chain of the California court system.

The California legislature, meanwhile, twice voted, in 2005 and in 2007, to allow same-sex marriage (the only legislature in the country to have done so), and twice Governor Schwarzenegger vetoed the bill, claiming that the court case should first run its course. To complicate matters, in 2000 the people of California voted for Proposition 22, which reads “Only marriage between a man and a woman is valid or recognized in California.”

Today, the California Supreme Court ruled that prohibiting same-sex couples from marrying is against the California constitution, and this overrules Proposition 22.

So the punchline of this great cartoon no longer applies to California, at least until November. (When a constitutional amendment against same-sex marriage will be on the ballot. Schwarzenegger has stated his opposition to the proposed amendment.)

Update: some pictures Continue reading

The Spectral Partitioning Algorithm

[In which we prove the “difficult part” of Cheeger’s inequality by analyzing a randomized rounding algorithm for a continuous relaxation of sparsest cut.]

We return to this month’s question: if G=(V,E) is a d-regular graph, how well can we approximate its edge expansion h(G) defined as

h(G) := \min_{S\subseteq V} \frac{edges(S,V-S)} {\min \{ |S|, \ |V-S| \} }

and its sparsest cut \phi(G) defined as

\phi(G) := \min_{S\subseteq V} \frac{edges(S,V-S)} { \frac 1n \cdot |S| \cdot |V-S|} =  \min_{x\in \{0,1\}^n } \frac{ \sum_{i,j} A(i,j) \cdot |x(i)-x(j)|}{\frac 1n  \sum_{i,j}  |x(i)-x(j)|},

where A is the adjacency matrix of G.

We have looked at three continuous relaxations of \phi(G), the spectral gap, the Leighton-Rao linear program, and the Arora-Rao-Vazirani semidefinite program.

As we saw, the spectral gap of G, defined as the difference between largest and second largest eigenvalue of A, can be seen as the solution to a continuous optimization problem:

d-\lambda_2 = \min_{x\in {\mathbb R}^n} \frac {\sum_{i,j} A(i,j) \cdot |x(i)-x(j)|^2}{\frac 1n  \sum_{i,j}  |x(i)-x(j)|^2}.

It follows from the definitions that

d-\lambda_2 \leq \phi \leq 2h

which is the “easy direction” of Cheeger’s inequality, and the interesting thing is that d-\lambda_2 is never much smaller, and it obeys

(1) \ \ \ d-\lambda_2 \geq \frac {h^2}{2d} \geq \frac {\phi^2}{8d},

which is the difficult part of Cheeger’s inequality. When we normalize all quantities by the degree, the inequality reads as

\frac 1 8 \cdot \left( \frac \phi d \right)^2 \leq \frac 1 2 \cdot \left( \frac h d \right)^2 \leq \frac{d-\lambda_2}{d} \leq \frac \phi d  \leq 2 \frac h d .

I have taught (1) in three courses and used it in two papers, but I had never really understood it, where I consider a mathematical proof to be understood if one can see it as a series of inevitable steps. Many steps in the proofs of (1) I had read, however, looked like magic tricks with no explanation. Finally, however, I have found a way to describe the proof that makes sense to me. (I note that everything I will say in this post will be completely obvious to the experts, but I hope some non-expert will read it and find it helpful.)

We prove (1), as usual, by showing that given any x\in {\mathbb R}^n such that

(2) \ \ \ \frac {\sum_{i,j} A(i,j) \cdot |x(i)-x(j)|^2}{\frac 1n  \sum_{i,j}  |x(i)-x(j)|^2} = \epsilon

we can find a threshold t such that the cut (S,V-S) defined by S:= \{ i: x_i \geq t \} satisfies

(3)  \ \ \ \frac{edges(S,V-S)} { \min \{ |S|,\ |V-S| \} } \leq \sqrt{2 d \epsilon}


\frac{edges(S,V-S)} { \frac 1n \cdot |S| \cdot |V-S| } \leq \sqrt{8 d \epsilon}.

This not only gives us a proof of (1), but also an algorithm for finding sparse cuts when they exist: take a vector x which is an eigenvector of the second eigenvalue (or simply a vector for which the Rayleigh quotient in (2) is small), sort the vertices i according to the value of x(i), and find the best cut among the “threshold cuts.” This is the “spectral partitioning” algorithm.

This means that proving (1) amounts to studying an algorithm that “rounds” the solution of a continuous relaxation to a combinatorial solution, and there is a standard pattern to such arguments in computer science: we describe a randomized rounding algorithm, study its average performance, and then argue that there is a fixed choice that is at least as good as an average one. Here, in particular, we would like to find a distribution T over threshold, such that if we define S:= \{ i: x(i) \geq T\} as a random variable in terms of T we have

{\mathbb E} [ edges(S,V-S) ] \leq {\mathbb E}  [\min \{ |S|,\ |V-S| \} ] \cdot  \sqrt{2 d \epsilon}

and so, using linearity of expectation,

{\mathbb E} [ edges(S,V-S)  - \min \{ |S|,\ |V-S| \}  \cdot  \sqrt{2 d \epsilon}] \leq 0

from which we see that there must be a threshold in our sample space such that (3) holds. I shall present a proof that explicitly follows this pattern. It would be nice if we could choose T uniformly at random in the interval \left[\min_i x(i), \max_i x(i)\right], but I don’t think it would work. (Any reader can see a counterexample? I couldn’t, but we’ll come back to it.) Instead, the following works: assuming with no loss of generality that the median of \{ x(1),\ldots,x(n) \} is zero, T can be chosen so that |T|\cdot T is distributed uniformly in the interval \left[\min_i x(i), \max_i x(i)\right]. (This means that thresholds near the median are less likely to be picked than thresholds far from the median.) This choice seems to be a magic trick in itself, voiding the point I made above, but I hope it will become natural as we unfold our argument. Continue reading


I eventually found my way through the intimidatingly large program of the San Francisco International Film Festival, and to the festival itself.

The movie I most enjoyed was Sleep Dealer. It is set in a near-future where the kind of low-paying jobs now often held by immigrants from developing countries (driving taxis, constructions work, waiting tables) are done by machines that are controlled remotely via a virtual reality technology. And the machines are operated, of course, by low-paid workers in developing countries, so that the richer countries can have all the benefits of immigration, without the immigrants. During the Q&A, the filmmaker Alex Rivera made the point that (as far as he knows) this is the first movie to look into the future of developing countries. One often sees SF movies that imagine the future of New York, or London, but not of Tijuana or Delhi. One woman asked him how he reconciles the political message of his movie with the corporate sponsorship. (She noticed an acknowledgment to Coca Cola in the closing credits.) I liked his response: “we are all steeped in inescapable horror,” he started, “the Gap clothes I am wearing were made in a sweatshop, and the meat I ate for dinner came from animals that were treated inhumanely.” And he went on to say that we cannot fight everything, we have to keep on living, but for the big things, then it is worth trying to push back as much as possible. (I completely agree.)

Big Man Japan, about a middle-aged Japanese super-hero was also a lot of fun.

I also saw two movies from Chinese “sixth generation” directors. Still Life won the golden lion (the top prize) at the 2006 Venice film festival, and I should have known better than to go see it: European film festival juries are populated by the worst kind of film snobs, and watchable movies are not their thing. In the spirit of Italian neo-realismo, the movie is interested in seeing great societal change through the eyes of the “little people” that are affected by them, and via small, disjointed, stories. I don’t get it. I concede, however, that the scenes about the towns being demolished in preparation for the rising level of water after the Three Gorges Dam is completed, are incomparably more moving, if considerably less polished, than the same images in Manufactured Landscapes Umbrella was a fascinating documentary shot as part of a large project that will produce ten documentaries a year for ten years. Following umbrellas from the places where they are produced, to the places where they are traded, to the places where they are used, the movie looks into five classes of Chinese society, factory workers, merchants, college students, soldiers, and farmers. The shots inside a PLA training facility are fascinating, and the final segment was very moving, with a peasant complaining about the rising costs of farming, the lack of welfare, and then reminiscing about the various farming policies from the 1950s on, and finally weeping by just thinking about the time of the cultural revolution.

Relaxations of Edge Expansion and Their Duals (Part 2)

[In which we finally get to Leighton-Rao and ARV.]

Continuing our edge expansion marathon, we are going to see more ways in which the sparsest cut problem can be relaxed to polynomial-time solvable continuous problems. As before, our goal is to understand how this implies certificates of expansion.

We are talking about the sparsest cut \phi(G) of a graph G(V,E), which is

\phi(G) := \min_{S\subseteq V} \frac{edges(S,V-S)}{\frac 1n |S|\cdot V-S|} = \min_{x\in \{0,1\}^n} \frac { \sum_{i,j} A(i,j) |x(i)-x(j)|} { \frac 1n \sum_{i,j} |x(i)-x(j)|}

We have looked for a while at the eigenvalues gap of G. If G is d-regular and the second largest eigenvalue of G is \lambda_2, then the difference between largest and second largest eigenvalue obeys

d-\lambda_2 = \min_{x\in {\mathbb R}^n} \frac { \sum_{i,j} A(i,j) |x(i)-x(j)|^2} {\frac 1n \sum_{i,j} |x(i)-x(j)|^2} = \min_{m,x_1,\ldots,x_n\in {\mathbb R}^m} \frac {  \sum_{i,j} A(i,j) ||x_-x_j||^2} {\frac 1n \sum_{i,j} ||x_i-x_j||^2}

and, noting that |x(i)-x(j)|^2 = |x(i)-x(j)| when x\in \{0,1\}^n, we saw that d-\lambda_2 is a relaxation of \phi, and \phi \geq d-\lambda_2.

Leighton and Rao derive a continuous relaxation of \phi in quite a different way. They note that when x\in \{ 0,1\}^n, then the “distance”

d(i,j) := |x(i)-x(j)|

induces a semi-metric on V, that is, d(i,i)=0, d(i,j) = d(j,i), and the triangle inequality holds

d(i,j) \leq d(i,k) + d(k,j) \ \ \ \forall i,j,k

One can then consider the relaxation

(1) \ \ \ min_d \frac { \sum_{i,j} A(i,j) d(i,j)}  {\frac 1n \sum_{i,j} d(i,j)}

where the minimum is taken over all distance functions d: V\times V \rightarrow {\mathbb R} that define a semi-metric on V.

We can rewrite (1) as the linear program

(2) \ \ \ \begin{array}{l} \min \sum_{i,j} A(i,j) d(i,j)\\ subject\ to \\ \sum_{i,j} d(i,j)=n\\ d(i,j) \leq d(i,k) + d(k,j) \ \ \ \forall i,j,k\\ d(i,j) \geq 0 \ \ \ \forall i,j \end{array}

where we only write the triangle inequality constraints by having one variable d(i,j) for every unordered pair i,j.

The minimum of (2) gives a lower bound to the sparsest cut of G, and so every feasible solution for the dual of (2) gives a lower bound to \phi(G).

Before trying to write down the dual of (2), it is convenient to write (2) in a less efficient way. (This is akin to the fact that sometimes, when using induction, it is easier to prove a more general statement.) Instead of writing the triangle inequality for every three vertices, we write it for every arbitrary sequence of vertices.

(3) \ \ \ \begin{array}{l} \min \sum_{i,j} A(i,j) d(i,j)\\ subject\ to \\ \sum_{i,j} d(i,j)=n\\ d(i,j) \leq \sum_{(u,v) \in p} d(u,v) \ \ \ \forall i,j, \forall p \in P_{i,j} \\  d(i,j) \geq 0 \ \ \ \forall i,j\end{array}

where we denote by P_{i,j} the set of “paths” in the complete graph between i and j. (That is, P_{i,j} is the set of all sequences of vertices that start at i and end at j.)

Clearly (3) is the same as (2), in the sense that the extra inequalities present in (3) are implied by the triangle inequalities in (2). The dual of (3), however, is cleaner, and is given below

(4) \ \ \ \begin{array}{l} \max n\cdot y\\ \ y - \sum_{p\in P_{i,j}} y_p + \sum_{p: (i,j) \in p} y_p \leq A(i,j)\ \ \ \forall i,j\\ y,y_p \geq 0\end{array}

A feasible solution to (4) is thus a weighted set of paths and a parameter y. Before reasoning about the meaning of a solution, suppose that y,y_p satisfy the stronger constraints

(5) \ \ \ \begin{array}{l} \sum_{p\in P_{i,j}} y_p \geq y\\\sum_{p: (i,j) \in p} y_p \leq A(i,j) \end{array}

Then we can view the y_p : p \in P_{i,j} as defining a flow between i,j that carries at least y units of flow; furthermore, the union of all such flows uses at most a total capacity A(i,j) on each edge (i,j). We should think of it as an embedding of the complete graph scaled by y into G. I want to claim that this means that the sparsest cut of G is at least y times the sparsest cut of the complete graph, that is, at least y\cdot n.

To verify the claim, take any cut S,V-S. We know that such a cut separates |S| \cdot |V-S| pairs of vertices i,j, and that at least y units of flow are routed between each such pair, so that at least y\cdot |S| \cdot |V-S| units of flow cross the cut. Since every edge has capacity 1, we see that at least y\cdot  |S|\cdot  |V-S| edges of G must cross the cut.

It remains to bridge the difference between (4) or (5). One possibility is to show that a solution to (4) can be modified to a solution satisfying (5) with no loss in y. Another approach is to directly use (4) and a general weak duality argument to show directly that a solution satisfying (4) implies that G has at least the edge expansion of a complete graph with edge weights y.

There is, in fact, a broader principle at work, which we can verify by a similar argument: if \{ y_p\} is an assignment of weights to all possible paths (which we think of as defining a multicommodity flow), G is a graph with adjacency matrix A, H is a graph with adjacency matrix B, and we have

B(i,j) - \sum_{p\in P_{i,j}} y_p + \sum_{p: (i,j) \in p} y_p \leq A(i,j)\ \ \ \forall i,j

then we must have \phi(G) \geq \phi(H).

The fact that a feasible solution to (4) gives a lower bound to \phi(G) is a special case that applies to the case in which H is a scaled version of the complete graph.

This is a good point to pause, look back again at the bound coming from spectral considerations, and compare it with what we have here. Continue reading

Relaxations of Edge Expansion and Their Duals

In the previous post, we saw that the sparsest cut of a graph G=(V,E) (which is within a factor of 2 of the edge expansion of the graph), is defined as

\phi := \min_{S\subseteq V} \frac{edges(S,V-S)}{\frac 1n |S|\cdot |V-S|}

and can also be written as

(1) \ \ \ \phi = \min_{x\in \{0,1\}^V} \frac{\sum_{i,j} A(i,j) | x(i)-x(j)|}{\frac 1n \sum_{i,j} |x(i)-x(j)|}

while the eigenvalue gap of G, can be written as

(2) \ \ \ d-\lambda_2 = \min_{x\in {\mathbb R}^V} \frac{\sum_{i,j} A(i,j) | x(i)-x(j)|^2}{\frac 1n \sum_{i,j} |x(i)-x(j)|^2}

and so it can be seen as a relaxation of \phi, considering that
|x(i)-x(j)|=|x(i)-x(j)|^2 when x(i),x(j) are boolean.

Let us now take a broader look at efficiently computable continuous relaxations of sparsest cut, and see that the eigenvalue gap is a semidefinite programming relaxation of sparsest cut, that by adding the triangle inequality to it we derive the Arora-Rao-Vazirani relaxation, and that by removing the semidefinite constraint from the ARV relaxation we get the Leighton-Rao relaxation. Continue reading