You are currently browsing the category archive for the ‘theory’ category.

Dieter van Melkebeek, the conference chair of the Computational Complexity Conference (CCC), has started a discussion on whether to end the IEEE “sponsorship” of CCC. This is, I realize, a supremely boring topic, which seemingly affects only the handful of people who are tasked with organizing the conference. It does, however, affect to some extent all who ever paid a registration fee or published a paper in CCC, and I would like to discuss why I am, in the strongest possible way, in favor of leaving IEEE.

First of all, the term “sponsoring” here is a bit misleading. IEEE sponsors CCC not in the way Coca Cola sponsors the Olympics, but more in the way pirates sponsor navigation, or Elsevier sponsors scholarly publications. That is, it’s not that IEEE subsidizes the conference in exchange for exposure and good will; IEEE takes a huge chunk of our registration money, in exchange for making the lives of the local organizers harder. (I don’t mean to single out IEEE, when a professional society sponsors a conference it always works this way, but IEEE is particularly skilled at making the lives of the organizers hard.)

Last year I was an organizer of the complexity conference at Stanford. We had 108 attendees, who paid a total of $37,425 of registrations, or$346.5 each on average. How did we spend this money?

IEEE charged a 20% overhead to all expenses; that’s about $60 per person and about$6,500 for the whole conference. This, I should emphasize, is for doing literally nothing, except creating problems. For example, our conference could not be “approved” until all the paperwork of the previous conference was closed; it wasn’t closed because they were expecting some banking documents from them and I don’t remember if the issue was that the document does not exist in Portugal, or that they had already received the document and they had lost it. We were not allowed to make the conference web site go live until this was resolved.

One of the perks of organizing the conference with IEEE is that they provide free banking in the United States. (If the conference were organized by a created-for-this-purpose non-profit, it could have its own permanent bank account for little or no fee.) Three weeks after we sent all the paperwork to have our IEEE bank account set up, we get an email from the person we paid $6,500 to “assist” us, saying “URGENT, URGENT, you have to send us the paperwork for the bank account”. After we reminded them that they had had it for three weeks, they replied, “oh yes, we do,” no apology. (And it still took a while to get the checks.) The free banking does not include free registration handling, however. Here I accept responsibility for being foolish enough, after all this, to trust IEEE with their registration service. At a cost of more than$26 per person (that’s almost $3,000) they produced a registration website that seemed put together by a high school intern, and which required filling up five pages of… well, if you attended the conference you remember it. Finally, IEEE press charged almost$2,000 (almost $18 per person) for the proceedings. Which proceedings, you may ask, since the conference had no proceedings? That’s a very good question. The charge of$1,925 was to take our papers, take the copyright, and then put the papers were it is impossible to find them, even if you subscribe to the IEEE digital library. (Seriously, try to google a paper appeared in an ACM conference and one appeared in an IEEE conference and see if you are able to get the IEEE paper. Say what you want about the ACM, at least they know how to build a web site.)

In summary, IEEE took $6,500 for doing nothing but causing delays,$3,000 for a terrible registration site, and $2,000 to hide our papers. That’s more than$100 per attendee, and about 30% of how we spent the registration fee. The rest went on food and room rent.

Why are we still doing this? One reason is that the only people that are really affected by this system are the local organizers, and once one is done with the job, one doesn’t want to hear of or talk about IEEE any more. The other reason is that there is a big initial effort that is needed to make the conference independent. One needs to start some kind of entity to own the conference (the IACR, for example, was founded pretty much for the purpose of running CRYPTO and the other crypto conferences), which needs to have a statute, officers, a bank account and all kind of paperwork needs to be done. After that, however, it’s smooth sailing and considerable savings.

Here is one thing that IEEE does: if the conference runs a deficit, they cover it. We did, in fact run a deficit last year, of about $1,000; so IEEE “covered” it, but that just means that instead of$6,500 for the “sponsorship” they got $5,500. If, going forward, we budget with the intention of putting away$5,000 to $7,000 per year, the registration costs will go down slightly and over a few years we can put away, say$20,000 that can be a cushion for any kind of unforeseen event, and from that point forward budget to a balanced budget, with notably lower registration fees, and with some years running a surplus and some years running a deficit. Plus, we own our papers, and people can find them!

Ok, this is as boring as could be expected. I thank all those that have read so far, and, for the sanity of future local organizers and the principle of keeping our hard-earned taxpayer money and our papers, please support making CCC an independent conference.

Link to CCC forum

I would like to pass along the following piece of information that I received from Tal Rabin: the next bi-annual Women in Theory Workshop will be this May 28-30, 2014, in New York City, immediately preceding STOC 2014, which will also take place there.

Applications are due by January 20, 2014.

The relevant information is at womenintheory.wordpress.com

The Simons Institute for the Theory of Computing at Berkeley has started operations a couple of months ago, it has been off to a great start. This semester there is a program on applications of real analysis to computer science, in which I am involved, and one on “big data,” whose workshops have been having such high attendance that they had to be organized offsite.

The institute itself is housed in a beautiful circular three-story building, half of whose second floor is an open space with sofas, whiteboards, tall ceilings, big windows, and exposed pipes on the ceiling, for an added loft-like look. If it had a ball pit, a foosball table, and free sushi it would look like the offices of a startup.

Next year, there will be programs on spectral graph theory, on applications of algebraic geometry, and on information theory.

Junior people (senior graduate students, postdocs, and junior assistant professors who have received their PhD no longer than six years ago) can participate in the program of the institute as “fellows.” Information on the fellowships is at simons.berkeley.edu/fellows2014; the deadline to apply is December 15.

The theoretical computer science traveling circus returns to the San Francisco Bay Area: FOCS 2013 will be in Berkeley in about 3 weeks. The early registration deadline is this Friday, October 4.

The week before FOCS, the Simons Institute will have a workshop on parallel and distributed algorithms for learning and optimization, as part of its program on big data.

The conference will be held in the same hotel in which FOCS 2006 was held, in the Berkeley marina. It is not a bad idea to rent a car, which will make it easier to go to downtown Berkeley and to San Francisco for dinner. You will get to drive on the new and improved Eastern span of the Bay Bridge, which is totally safe, even if the bolts are defective and will need to be replaced, or so we are being told.

There is plenty to do around San Francisco, and this list of links maintained by the Simons institute is very helpful. A couple of suggestions specific to the weekend of the 26th: open studios will run in the Mission/Castro/Noe Valley area, and there are good coffee, food and drink options in the area after seeing the art. On Friday evening, the De Young museum is open until 9pm, and it has a full bar and live music in the lobby; downstairs there is a temporary exhibition of Bulgari jewels. The De Young was designed by Herzog and de Meuron, who designed the Tate Modern in London and the National Stadium in Beijing; it is one of the few beautiful modern buildings in San Francisco. On the 27th, there is a pre-Halloween costume 5k run in Golden Gate park. Also on the 27th, James Franco will be at the Castro theater to talk about his new book and about how awesome he is.

This week, the topic of my online course on graph partitioning and expanders is the computation of approximate eigenvalues and eigenvectors with the power method.

If $M$ is a positive semidefinite matrix (a symmetric matrix all whose eigenvalues are nonnegative), then the power method is simply to pick a random vector $x\in \{ -1,+1 \}^n$, and compute $y:= M^k x$. If $k$ is of the order of $\frac 1 \epsilon \log \frac n \epsilon$, then one has a constant probability that

$\frac {y^T M y}{y^T y} \geq (1-\epsilon) \max_{x} \frac {x^T M x}{x^T x} = (1-\epsilon) \lambda_1$

where $\lambda_1$ is the largest eigenvalue of $M$. If we are interested in the Laplacian matrix $L = I - \frac 1d A$ of a $d$-regular graph, where $A$ is the adjacency matrix of the graph, this gives a way to compute an approximation of the largest eigenvalue, and a vector of approximately maximum Rayleigh quotient, which is useful to approximate Max Cut, but not to apply spectral partitioning algorithms. For those, we need a vector that approximates the eigenvector of the second smallest eigenvalue.

Equivalently, we want to approximate the second largest eigenvalue of the adjacency matrix $A$. The power method is easy to adjust to compute the second largest eigenvalue instead of the largest (if we know an eigenvector of the largest eigenvalue): after you pick the random vector, subtract the component of the vector that is parallel to the eigenvector of the largest eigenvalue. In the case of the adjacency matrix of a regular graph, subtract from every coordinate of the random vector the average of the coordinates.

The adjacency matrix is not positive semidefinite, but we can adjust it to be by adding a multiple of the identity matrix. For example we can work with $\frac 12 I + \frac 1{2d} A$. Then the power method reduces to the following procedure: pick randomly $x \sim \{ -1,1\}$, then subtract $\sum_i x_i/n$ from every entry of $x$, then repeat the following process $k = O\left( \frac 1 \epsilon \log \frac n \epsilon \right)$ times: for every entry $i$, assign $x_i := \frac 12 x_i + \frac 1 {2d} \sum_{j: (i,j) \in E} x_j$, that is, replace the value that the vector assigns to vertex $i$ with a convex combination of the current value and the current value of the neighbors. (Note that one iteration can be executed in time $O(|V|+|E|)$.

The problem is that if we started from a graph whose Laplacian matrix has a second smallest eigenvalue $\lambda_2$, the matrix $\frac 12 I + \frac 1{2d} A$ has second largest eigenvalue $1- \frac {\lambda_2}2$, and if the power method finds a vector of Rayleigh quotient at least $(1-\epsilon) \cdot \left( 1- \frac {\lambda_2}2 \right)$ for $\frac 12 I + \frac 1{2d} A$, then that vector has Rayleigh quotient about $\lambda_2 - 2\epsilon$ for $L$, and unless we choose $\epsilon$ of the same order as $\lambda_2$ we get nothing. This means that the number of iterations has to be about $1/\lambda_2$, which can be quite large.

The video below (taken from this week’s lecture) shows how slowly the power method progresses on a small cycle with 31 vertices. It goes faster on the hypercube, which has a much larger $\lambda_2$.

A better way to apply the power method to find small eigenvalues of the Laplacian is to apply the power method to the pseudoinverse $L^+$ of the Laplacian. If the Laplacian of a connected graph has eigenvalues $0 = \lambda_1 < \lambda_2 \leq \cdots \leq \lambda_n$, then the pseudoinverse $L^+$ has eigenvalues $0, \frac 1 {\lambda_2}, \cdots, \frac 1 {\lambda_n}$ with the same eigenvectors, so approximately finding the largest eigenvalue of $L^+$ is the same problem as approximately finding the second smallest eigenvalue of $L$.

Although we do not have fast algorithms to compute $L^+$, what we need to run the power method is, for a given $x$, to find the $y$ such that $L y = x$, that is, to solve the linear system $Ly = x$ in $y$ given $L$ and $x$.

For this problem, Spielman and Teng gave an algorithm nearly linear in the number of nonzero of $L$, and new algorithms have been developed more recently (and with some promise of being practical) by Koutis, Miller and Peng and by Kelner, Orecchia, Sidford and Zhu.

Coincidentally, just this week, Nisheeth Vishnoi has completed his monograph Lx=b on algorithms to solve such linear systems and their applications. It’s going to be great summer reading for those long days at the beach.

Having (non-rigorously) defined the Laplacian operator in manifolds in the previous post, we turn to the proof of the Cheeger inequality in manifolds, which we restate below.

Theorem 1 (Cheeger’s inequality) Let ${M}$ be an ${n}$-dimensional smooth, compact, Riemann manifold without boundary with metric ${g}$, let ${L:= - {\rm div} \nabla}$ be the Laplace-Beltrami operator on ${M}$, let ${0=\lambda_1 \leq \lambda_2 \leq \cdots }$ be the eigenvalues of ${L}$, and define the Cheeger constant of ${M}$ to be

$\displaystyle h(M):= \inf_{S\subseteq M : \ 0 < \mu(S) \leq \frac 12 \mu(M)} \ \frac{\mu_{n-1}(\partial(S))}{\mu(S)}$

where the ${\partial (S)}$ is the boundary of ${S}$, ${\mu}$ is the ${n}$-dimensional measure, and ${\mu_{n-1}}$ is ${(n-1)}$-th dimensional measure defined using ${g}$. Then

$\displaystyle h(M) \leq 2 \sqrt{\lambda_2} \ \ \ \ \ (1)$

We begin by recalling the proof of the analogous result in graphs, and then we will repeat the same steps in the context of manifolds.

Theorem 2 (Cheeger’s inequality in graphs) Let ${G=(V,E)}$ be a ${d}$-regular graph, ${A}$ be its adjacency matrix, ${L:= I - \frac 1d A}$ be its normalized Laplacian matrix, ${0 = \lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_{|V|}}$ be the eigenvalues of ${L}$, and define ${\mu(S):= d \cdot |S|}$ for every subset of vertices ${S\subseteq V}$. Define the conductance of ${G}$ as

$\displaystyle \phi(G) := \min_{S\subseteq V: \ 0 < \mu(S) \leq \frac 12 \mu(V) } \frac{| \partial S|}{\mu(S)}$

where ${\partial S}$ is the number of edges with one endpoint in ${S}$ and one endpoint in ${\bar S}$. Then

$\displaystyle \phi(G) \leq \sqrt{2 \lambda_2} \ \ \ \ \ (2)$

1. Proof the Cheeger inequality in graphs

We will use the variational characterization of the eigenvalues of the Laplacian ${L}$ of a graph ${G}$.

$\displaystyle \lambda_1 = \min_{f \in {\mathbb R}^v} \frac {f^T Lf}{f^T f} \ \ \ \ \ (3)$

and if ${f_1}$ is a minimizer in the above expression then

$\displaystyle \lambda_2 = \min_{f \in {\mathbb R}^V: \ f \perp f_1} \frac {f^T Lf}{f^T f}$

Following the definition of ${L}$ we see that

$\displaystyle f^T L f = \frac 1d \sum_{(u,v)\in E} |f_u - f_v |^2$

and so the minimum in (3) is 0, and it is achieved for ${f = (1,\cdots,1)}$. This means that

$\displaystyle \lambda_2 = \min_{f \in {\mathbb R}^V: \ \sum_v f_v = 0} \frac {\sum_{(u,v) \in E} |f_u-f_v|^2}{d \sum_v f_v^2} \ \ \ \ \ (4)$

The expression in the right-hand-side of (4) is an important one, and it is called the Rayleigh quotient of ${f}$, which we will denote by ${R(f)}$:

$\displaystyle R(f):= \frac {\sum_{(u,v) \in E} |f_u-f_v|^2}{d \sum_v f_v^2}$

It is also useful to consider the variant of the Rayleigh quotient where there are no squares; this does not have a standard name, so let us call it the ${\ell_1}$ Rayleigh quotient and denote it by ${R_1}$:

$\displaystyle R_1(g):= \frac {\sum_{(u,v) \in E} |g_u-g_v|}{d \sum_v |g_v|}$

The proof of the graph Cheeger inequality now continues with the proof of the following three facts.

Lemma 3 (Rounding of ${\ell_1}$ embeddings) For every non-negative vector ${g\in {\mathbb R}^V_{\geq 0}}$ there is a value ${t\geq 0}$ such that

$\displaystyle \phi(\{ v: g(v) > t \}) \leq R_1(g)$

Lemma 4 (Embedding of ${\ell_2^2}$ into ${\ell_1}$) For every non-negative vector ${f\in {\mathbb R}^V_{\geq 0}}$, we have

$\displaystyle R_1(f^2) \leq \sqrt{2 R(f)}$

Lemma 5 (From an eigenvector to a non-negative vector) For every ${f\in {\mathbb R}^V}$ such that ${\sum_v f_v =0}$ there is a non-negative ${f'\in {\mathbb R}^V_{\geq 0}}$ such that ${\mu(\{ v: f'(v) >0 \}) \leq \frac 12 \mu(V)}$ and such that

$\displaystyle R(f') \leq R(f)$

Now let us start from a function ${f}$ that optimizes (4), so that ${\sum_v f_v = 0}$ and ${R(f) = \lambda_2}$, then apply Lemma 5 to find a function ${f'}$ such that the volume of the vertices having positive coordinates in ${f'}$ is at most ${\frac 12 \mu(V)}$ and such that ${R(f') \leq R(f) = \lambda_2}$. Then consider the vector ${g\in {\mathbb R}^V}$ such that ${g_v := f'^2_v}$; by Lemma 4, we have ${R_1(g) \leq \sqrt{2 R(f')} \leq \sqrt{2\lambda_2}}$, and by Lemma 3 there is a threshold ${t}$ such that the set ${S:= \{ v: g(v) > t \}}$ has conductance ${h(S) \leq \sqrt {2\lambda_2}}$. Since ${S}$ is a subset of the vertices having positive coordinates in ${g}$, we have ${\mu(S) \leq \frac 12 \mu(V)}$, and so

$\displaystyle \phi(G) \leq \sqrt{2 \lambda_2 }$

which is the Cheeger inequality for graphs. It remains to prove the three lemmas.

Proof: of Lemma 3. For each threshold ${t}$, define the set

$\displaystyle S_t := \{ v: g_v > t \}$

The idea of the proof is that if we pick ${t}$ at random then the probability that an edge belongs to ${\partial S_t}$ is proportional to ${|g_u - g_v|}$ and the probability that ${v\in S_t}$ is proportional to ${|g_v|}$, so that the expected number of edges in ${\partial S_t}$ is proportional to the numerator of ${R_1(g)}$ and the expected number of vertices in ${S_t}$ is proportional to the denominator of ${R_1(g)}$; if ${R_1(g)}$ is small, it is not possible for ${\partial S_t/\mu(S_t)}$ to always be large for every ${t}$.

To avoid having to normalize the range of ${t}$ to be between ${0}$ and ${1}$, instead of taking averages over a random choice of ${t}$, we will consider the integral over all values of ${t}$. We have

$\displaystyle \int_0^\infty |\partial (S_t) | {\rm d} t = \sum_{u,v} |g_u - g_v |$

because we can write ${|\partial (S_t)| = \sum_{(u,v) \in E} I_{u,v} (t)}$, where ${I_{u,v} (t) = 1}$ if ${(u,v) \in \partial S_t}$ and ${I_{u,v}(t) = 0}$ otherwise, and we see that only the values of ${t}$ between ${g_u}$ and ${g_v}$ make ${I_{u,v}(t) =1}$, so ${\int_{0}^\infty I_{u,v} (t) {\rm d} t = |g_u - g_v|}$.

We also have

$\displaystyle \int_0^\infty d |S_t| {\rm d} t = d \sum_v |g_v|$

and if we denote by ${t^*}$ the threshold such that ${\phi(S_{t^*})}$ is smallest among all the ${\phi(S)}$, then

$\displaystyle \sum_{u,v} |g_u - g_v | = \int_0^\infty |\partial (S_t) | {\rm d} t$

$\displaystyle \geq \int_0^\infty h(S_{t^*}) d|S_t| {\rm d} t$

$\displaystyle = h(S_{t^*}) d \sum_v |g_v|$

so that

$\displaystyle h(S_{t^*}) \leq \frac { \sum_{u,v} |g_u - g_v | }{d \sum_v |g_v| } = R_1(g)$

$\Box$

Proof: of Lemma 3. Let us consider the numerator of ${R_1(f^2)}$; it is:

$\displaystyle \sum_{(u,v)\in E} |f^2_u - f^2_v|$

$\displaystyle = \sum_{(u,v)\in E} |f_u - f_v| \cdot (f_u + f_v)$

$\displaystyle \leq \sqrt{\sum_{(u,v)\in E} |f_u - f_v|^2} \sqrt{ \sum_{(u,v)\in E}(f_u + f_v)^2 }$

(we used Cauchy-Swarz)

$\displaystyle \leq \sqrt{R(f) \cdot d\sum_v f_v^2} \sqrt{ \sum_{(u,v)\in E} 2f_u^2 + 2f_v^2 }$

(we used the definition of ${R_2}$ and Cauchy-Swarz again)

$\displaystyle = \sqrt{R(f) \cdot d\sum_v |f_v|^2} \sqrt{ 2 d\sum_{v} f_v^2 }$

$\displaystyle = \sqrt{2 R(f) } \cdot d \sum_v f_v^2$

And so

$\displaystyle R_1(f^2) = \frac {\sum_{(u,v)\in E} |f^2_u - f^2_v|}{d \sum_v f_v^2} \leq \sqrt{2 R_2(f) }$

$\Box$

Proof: of Lemma 5. Let ${m}$ be the median of ${f}$, and consider ${\bar f}$ defined as ${\bar f_v := f_v - m}$. We have

$\displaystyle R(\bar f) \leq R(f)$

because the numerators of ${R(\bar f)}$ and ${R(f)}$ are the same (the additive term ${-m}$ cancels). The denominators are such that

$\displaystyle \sum_v \bar f_v^2 = || \bar f||^2 = || f||^2 + || - m \cdot {\bf 1} ||^2 \geq || f||^2 = \sum_v f_v^2$

because ${f}$ and the vector ${{\bf 1} = (1,\ldots,1)}$ are orthogonal, and so by Pythagoras’s theorem the length-squared of ${\bar f = f - m {\bf 1}}$ equals the length-squared of ${f}$ plus the length-squared of ${-m {\bf 1}}$.

Let us define ${f^+_v := \min\{ 0, \bar f_v\}}$ and ${f^-_v := \min \{ 0, -\bar f_v\}}$ so that ${\bar f = f^+ - f^-}$. We use the following fact:

Fact 6 Let ${a,b \in {\mathbb R}^V_{\geq 0}}$ be disjointly supported non-negative vectors (“disjointly supported” means that they are non-zero on disjoint subsets of coordinates), then

$\displaystyle \min\{ R(a) , R(b) \} \leq R(a-b)$

Proof: The numerator of ${R(a-b)}$ is

$\displaystyle \sum_{(u,v)\in E} |a_v - b_v + b_u - a_v|^2 \geq \sum_{(u,v)\in E} |a_v - a_u|^2 + |b_v - b_u|^2$

and, using orthogonality and Pythagoras’s theorem, the denominator of ${R(a-b)}$ is

$\displaystyle d || a- b||^2 = d||a ||^2 + d|| b||^2$

The fact now follows from the inequality

$\displaystyle \min \left \{ \frac {n_1}{d_1} , \frac{n_2}{d_2} \right \} \leq \frac{n_1+n_2}{d_1+d_2}$

$\Box$

The lemma now follows by observing that ${f^+}$ and ${f^-}$ are non-negative and disjointly supported, so

$\displaystyle \min\{ R(f^+) , R(f^-) \} \leq R(\bar f) \leq R(f)$

and that both ${f^+}$ and ${f^-}$ have at most ${n/2}$ non-zero coordinate. $\Box$

2. Proof of the Cheeger inequality in manifolds

We will now translate the proof of the graph Cheeger inequality to the setting of manifolds.

As you may remember, we started off by saying that ${L}$ is symmetric and so all its eigenvalues are real and they are given by the variational characterization. Now we are already in trouble because the operator ${L}$ on manifolds cannot be thought of as a matrix, so what does it mean for it to be symmetric? The consequence of symmetry that is exploited in the analysis of the spectrum of symmetric matrices is the fact that if ${A}$ is symmetric, then for every ${x,y}$ we have

$\displaystyle \langle x,Ay \rangle = x^T Ay = (A^Tx)^Ty = (Ax)^Ty = \langle Ax,y \rangle$

and the property ${\langle x, Ay \rangle = \langle Ax , y \rangle}$ makes no references to coordinates, and it is well defined even for linear operators over infinite-dimensional spaces, provided that there is a notion of inner product. If we the define the inner product

$\displaystyle \langle f,g \rangle := \int_M f(x)\cdot g(x) \ {\rm d} \mu(x)$

on functions ${f: M \rightarrow {\mathbb R}}$, and more generally

$\displaystyle \langle f,g \rangle := \int_M \langle f(x), g(x)\rangle_X \ {\rm d} \mu(x)$

for functions ${f: M \rightarrow X}$, where ${X}$ is a vector space with inner product ${\langle \cdot,\cdot,\rangle_X}$, then we can say that an operator ${A}$ is self-adjoint if

$\displaystyle \langle f, Ag \rangle = \langle Af ,g \rangle$

for all (appropriately restricted) functions ${f,g}$. If ${M}$ is compact, this property is true for the Laplacian, and, in particular, ${-{\rm div}}$ and ${\nabla}$ are adjoints of each others, that is,

$\displaystyle \langle \nabla f, g \rangle = \langle f, - {\rm div} g \rangle$

(The discrete analog would be that ${C^T}$ is the transpose of ${C}$.)

Self-adjointness (and appropriate conditions on ${M}$) imply a version of the spectral theorem and of the variational characterization. In particular, all eigenvalues of ${L}$ are real, and if there is a minimum one then it is

$\displaystyle \lambda_1 = \min_{f: M \rightarrow {\mathbb R}} \frac {\langle f, Lf \rangle}{\langle f,f\rangle}$

and if ${f_1}$ is a minimizer of the above, then

$\displaystyle \lambda_2 = \min_{f: M \rightarrow {\mathbb R},\ \langle f, f_1 \rangle = 0} \frac {\langle f, Lf \rangle}{\langle f,f\rangle}$

(The minimization is quantified over all functions that are square-integrable, and the minimum is achieved because if ${M}$ is compact then the space of such functions is also compact and the cost function that we are minimizing is continuous. In this post, whenever we talk about “all functions,” it should be understood that we are restricting to whatever space of functions makes sense in the context.)

From the property that ${-{\rm div}}$ and ${\nabla}$ are adjoint, we have

$\displaystyle \langle f, Lf \rangle = \langle f, -{\rm div} \nabla f \rangle = \langle \nabla f,\nabla f \rangle$

so

$\displaystyle \lambda_1 = \min_{f:M \rightarrow {\mathbb R}} \frac{ \int_M \langle \nabla f(x),\nabla f(x) \rangle \ {\rm d} \mu(x) } {\int M f^2(x) \ {\rm d} \mu (x)}$

where the Rayleigh quotient

$\displaystyle R(f):= \frac{ \int_M \langle \nabla f(x),\nabla f(x) \rangle \ {\rm d} \mu(x) } {\int M f^2(x) \ {\rm d} \mu (x)} = \frac {\int_M ||\nabla f(x)||^2 \ {\rm d} \mu (x)}{\int_M f^2(x) \ {\rm d} \mu (x)}$

is always non-negative, and it is zero for constant ${f\equiv 1}$, so we see that ${\lambda_1=0}$ and

$\displaystyle \lambda_2 = \min_{f:M \rightarrow {\mathbb R} : \int_M f=0 \ {\rm d} \mu} R(f)$

By analogy with the graph case, we define the “${\ell_1}$ Rayleigh quotient”

$\displaystyle R_1(g) := \frac {\int_M ||\nabla g(x)|| \ {\rm d} \mu (x)}{\int_M |g(x)| \ {\rm d} \mu (x)}$

And we can prove the analogs of the lemmas that we proved for graphs.

Lemma 7 (Rounding of ${\ell_1}$ embeddings) For every non-negative function ${g: V \rightarrow {\mathbb R}_{\geq 0}}$ there is a value ${t\geq 0}$ such that

$\displaystyle h(\{ x: g(x) > t \}) \leq R_1(g)$

where the Cheeger constant ${h(S)}$ of a subset ${S\subseteq M}$ of the manifold is

$\displaystyle h(S) := \frac{\mu_{n-1} (\partial S)}{\mu(S)}$

Lemma 8 (Embedding of ${\ell_2^2}$ into ${\ell_1}$) For every non-negative function ${f; m \rightarrow {\mathbb R}_{\geq 0}}$, we have

$\displaystyle R_1(f^2) \leq \sqrt{2 R(f)}$

Lemma 9 (From an eigenfunction to a non-negative function) For every function ${f: M \rightarrow R}$ such that ${\int_M f \ {\rm d} \mu =0}$ there is a non-negative ${f': M \rightarrow {\mathbb R}_{\geq 0}}$ such that ${\mu(\{ x: f'(x) >0 \}) \leq \frac 12 \mu(V)}$ and such that

$\displaystyle R(f') \leq R_2(f)$

Let us see the proof of these lemmas.

Proof: of Lemma 7. For each threshold ${t}$, define the set

$\displaystyle S_t := \{ x: g(x) > t \}$

Let ${t*}$ be a threshold for which ${h(S_{t*})}$ is minimized

We will integrate the numerator and denominator of ${h(S_t)}$ over all ${t}$. The coarea formula for nonnegative functions is

$\displaystyle \int_M || \nabla g || {\rm d} \mu = \int_{0}^\infty \mu_{n-1} ( \partial \{ x: g(x) > t \}) {\rm d} t$

and we also have

$\displaystyle \int_M |g| {\rm d} \mu = \int_{0}^\infty \mu ( \{ x: g(x) > t \}) {\rm d} t$

which combine to

$\displaystyle \int_M || \nabla g || {\rm d} t = \int_0^\infty \mu_{n-1} ( \partial S_t) {\rm d} t$

$\displaystyle = \int_0^\infty h(S_t) \mu(S_t) {\rm d} t$

$\displaystyle \geq h(S_{t^*}) \int_0^\infty \mu(S_t) {\rm d} t$

$\displaystyle = h(S_{t^*}) \int_M g {\rm d} t$

so that

$\displaystyle h(S_{t^*}) \leq \frac{\int_M || \nabla g || {\rm d} t }{ \int_M g {\rm d} t} = R_1(g)$

$\Box$

Proof: of Lemma 8. Let us consider the numerator of ${R_1(f^2)}$; it is:

$\displaystyle \int_M || \nabla f^2|| {\rm d} \mu$

We can apply the chain rule, and see that

$\displaystyle \nabla f^2(x) = 2f(x) \cdot \nabla f(x)$

which implies

$\displaystyle \int_M || \nabla f^2 || {\rm d} \mu$

$\displaystyle = \int_M || 2f(x) \nabla f(x)|| {\rm d} \mu(x)$

$\displaystyle = \int_M 2|f(x)| \cdot ||\nabla f(x)|| {\rm d} \mu (x)$

and, after applying Caucy-Swarz,

$\displaystyle \leq \sqrt{\int_M 4 f^2(x) {\rm d} \mu(x)} \cdot \sqrt{\int_M ||\nabla f(x)||^2 {\rm d} \mu(x)}$

$\displaystyle = 2 \cdot \left( \int_M f^2 {\rm d} \mu \right) \cdot \sqrt{ R(f)}$

And so

$\displaystyle R_1(f^2) \leq 2 \sqrt{R_2(f) }$

$\Box$

Proof: of Lemma 9. Let ${m}$ be a median of ${f}$, and consider ${\bar f}$ defined as ${\bar f(x) := f(x) - m}$. We have

$\displaystyle R(\bar f) \leq R(f)$

because the numerators of ${R(\bar f)}$ and ${R(f)}$ are the same (the derivatives of functions that differ by a constant are identical) and the denominators are such that

$\displaystyle \int_M \bar f^2{\rm d} \mu$

$\displaystyle = \int_M f^2 - 2mf + m^2 {\rm d} \mu$

$\displaystyle = \int_M f^2 + m^2 {\rm d} \mu$

$\displaystyle \geq \int_M f^2 {\rm d} \mu$

where we used the fact the integral of ${f}$ is zero.

Let us define ${f^+(x) := \min\{ 0, \bar f(x)\}}$ and ${f^-_v := \min \{ 0, -\bar f(x)\}}$ so that ${\bar f(x) = f^+(x) - f^-(x)}$. We use the following fact:

Fact 10 Let ${a,b : M \rightarrow {\mathbb R}_{\geq 0}}$ be disjointly supported non-negative functions (“disjointly supported” means that they are non-zero on disjoint subsets of inputs), then

$\displaystyle \min\{ R(a) , R(b) \} \leq R(a-b)$

Proof: We begin with the following observation: if ${a}$ is a non-negative function, and ${a(x)=0}$, then ${\nabla a(x) = \{ \bf 0 \}}$, because ${x}$ has to be a local minimum.

Consider the expression ${||\nabla (a-b)||^2}$ occurring in the numerator of ${R(a-b)}$. We have

$\displaystyle ||\nabla (a-b)||^2$

$\displaystyle = || \nabla a - \nabla b ||^2$

$\displaystyle = || \nabla a||^2 + || \nabla b||^2 - 2 \langle \nabla a,\nabla b \rangle$

But

$\displaystyle \langle \nabla a,\nabla b \rangle = 0$

because for every ${x}$ at least one of ${a(x)}$ or ${b(x)}$ is zero, and so at least one of ${\nabla a(x)}$ or ${\nabla b(x)}$ is zero.

Using this fact, we have that the numerator of ${R(a-b)}$ is equal to the sum of the numerators of ${R(a)}$ and ${R(b)}$:

$\displaystyle \int_M ||\nabla a-b||^2 {\rm d} \mu = \int_M ||\nabla a-b||^2 {\rm d} \mu + \int_M ||\nabla a-b||^2 {\rm d}$

and the denominator of ${R(a-b)}$ is also the sum of the denominators of ${R(a)}$ and ${R(b)}$:

$\displaystyle =\int_M (a-b)^2 {\rm d} \mu$

$\displaystyle =\int_M a^2 {\rm d} \mu + \int_M b^2 {\rm d} \mu - 2 \int_M ab {\rm d} \mu$

$\displaystyle =\int_M a^2 {\rm d} \mu + \int_M b^2 {\rm d} \mu$

because ${a(x)b(x)=0}$ for every ${x}$. The fact now follows from the inequality

$\displaystyle \min \left \{ \frac {n_1}{d_1} , \frac{n_2}{d_2} \right \} \leq \frac{n_1+n_2}{d_1+d_2}$

$\Box$

The lemma now follows by observing that ${f^+}$ and ${f^-}$ are non-negative and disjointly supported, so

$\displaystyle \min\{ R(f^+) , R(f^-) \} \leq R(\bar f) \leq R(f)$

and that both ${f^+}$ and ${f^-}$ have a support of volume at most ${\frac 12 \mu(M)}$. $\Box$

If anybody is still reading, it is worth observing a couple of differences between the discrete proof and the continuous proof.

The ${\ell_1}$ Rayleigh quotient is defined slightly differently in the continuous case. It would correspond to defining it as

$\displaystyle \frac {\sum_{u\in V} \sqrt{\sum_{v: (u,v)\in E} (f_u-f_v)^2 } }{d \sum_v |f_v|}$

in the discrete case.

If ${a,b \in {\mathbb R}^V}$ are disjointly supported and nonnegative, the sum of the numerators of the Rayleigh quotients ${R(a)}$ and ${R(-b)}$ can be strictly smaller than the numerator of ${R(a-b)}$, while we always have equality in the continuous case. In the discrete case, the sum of the numerators of ${R(a)}$ and ${R(b)}$ can be up to twice the numerator of ${R(a+b)}$ (this fact is useful, but it did not come up in this proof), while again we have exact equality in the continuous case.

The chain rule calculation

$\displaystyle \nabla f^2 = 2f \nabla f$

corresponds to the step

$\displaystyle f_v^2 - f_u^2 = (f_v + f_u) \cdot (f_v - f_u)$

In the continuous case, ${f_v}$ and ${f_u}$ are “infinitesimally close”, so we can approximate ${f_v + f_u}$ by ${2f_v}$.

Readers of in theory have heard about Cheeger’s inequality a lot. It is a relation between the edge expansion (or, in graphs that are not regular, the conductance) of a graph and the second smallest eigenvalue of its Laplacian (a normalized version of the adjacency matrix). The inequality gives a worst-case analysis of the “sweep” algorithm for finding sparse cuts, it shows a necessary and sufficient for a graph to be an expander, and it relates the mixing time of a graph to its conductance.

Readers who have heard this story before will recall that a version of this result for vertex expansion was first proved by Alon and Milman, and the result for edge expansion appeared in a paper of Dodzuik, all from the mid-1980s. The result, however, is not called Cheeger’s inequality just because of Stigler’s rule: Cheeger proved in the 1970s a very related result on manifolds, of which the result on graphs is the discrete analog.

So, what is the actual Cheeger’s inequality?

Theorem 1 (Cheeger’s inequality) Let ${M}$ be an ${n}$-dimensional smooth, compact, Riemann manifold without boundary with metric ${g}$, let ${L:= - {\rm div} \nabla}$ be the Laplace-Beltrami operator on ${M}$, let ${0=\lambda_1 \leq \lambda_2 \leq \cdots }$ be the eigenvalues of ${L}$, and define the Cheeger constant of ${M}$ to be

$\displaystyle h(M):= \inf_{S\subseteq M : \ 0 < \mu(S) \leq \frac 12 \mu(M)} \ \frac{\mu_{n-1}(\partial(S))}{\mu(S)}$

where the ${\partial (S)}$ is the boundary of ${S}$, ${\mu}$ is the ${n}$-dimensional measure, and ${\mu_{n-1}}$ is ${(n-1)}$-th dimensional measure defined using ${g}$. Then

$\displaystyle h(M) \leq 2 \sqrt{\lambda_2} \ \ \ \ \ (1)$

The purpose of this post is to describe to the reader who knows nothing about differential geometry and who does not remember much multivariate calculus (that is, the reader who is in the position I was in a few weeks ago) what the above statement means, to describe the proof, and to see that it is in fact the same proof as the proof of the statement about graphs.

In this post we will define the terms appearing in the above theorem, and see their relation with analogous notions in graphs. In the next post we will see the proof.

The foundations of cryptography were laid down in 1982, the annus mirabilis that saw the publications of the work of Blum and Micali on pseudorandom generators, of Goldwasser and Micali on rigorous definitions of encryption, and of Yao, who gave a more general definitional approach. The paper of Shafi Goldwasser and Silvio Micali, in particular, introduced the incredibly influential concept of indistinguishability of distributions, and the idea of defining security in terms of simulation of an ideal model in which the security requirements are self-evident. (For example, because in the ideal model an adversary is not able to access the channel that we use to send encrypted data.) Almost every definition of security in cryptography follows the simulation approach, which also guides proofs of security. Shafi and Silvio both went on to do foundational work in cryptography, complexity theory, and algorithms, including their work on zero knowledge, secure multiparty computation, and property testing.

So it was with much joy, this early morning in Japan, that I heard the news that Shafi Goldwasser and Silvio Micali have been named as recipients of this year’s Turing award.

Omer Reingold has more information about their work. With no offense to colleagues around my age and younger, Shafi and Silvio are also representative of a time when leading theoretical computer scientists were more interesting people. They both have incredible charisma.

My favorite memory of Shafi and Silvio is from the time I interviewed for a faculty job at MIT. Shafi was in the last weeks of her pregnancy and did not make an appointment to see me, but then the day of my interview she changed her mind and showed up in Silvio’s office halfway through my meeting with him.

Silvio had been looking at my schedule and was giving me advice on how to talk to various people. Shafi asked what we were talking about, and then proceeded to give the opposite advice that Silvio had been giving me. The two of them spent the rest of the meeting arguing with each other.

Long in the planning, my online course on graph partitioning algorithms, expanders, and random walks, will start next month.

The course page is up for people to sign up. A friend of mine has compared my camera presence to Sheldon Cooper’s in “Fun with Flags,” which is sadly apt, but hopefully the material will speak for itself.

Meanwhile, I will be posting about some material that I have finally understood for the first time: the analysis of the Arora-Rao-Vazirani approximation algorithm, the Cheeger inequality in manifolds, and the use of the Selberg “3/16 theorem” to analyze expander constructions.

If you are not a fan of recorded performances, there will be a live show in Princeton at the end of June.

Oh man, not another election! Why do we have to choose our leaders? Isn’t that what we have the Supreme Court for?
– Homer Simpson

Nate Silver is now putting Barak Obama’s chance of reelection at around 85%, and he has been on the receiving end of considerable criticism from supporters of Mitt Romney. Some have criticized his statistical analysis by pointing out that he has a soft voice and he is not fat (wait, what? read for yourself – presumably the point is that Silver is gay and that gay people cannot be trusted with such manly pursuits as statistics), but the main point seems to be: if Romney wins the election then Silver and his models are completely discredited. (E.g. here.) This is like someone saying that a die has approximately a 83% probability of not turning a 2, and others saying, if I roll a die and it turns a 2, this whole “probability” thing that you speak of is discredited.

But still, when someone offers predictions in terms of probability, rather than simply stating that a certain outcome is more likely, how can we evaluate the quality of such predictions?

In the following let us assume that we have a sequence of binary events, and that each event $i$ has a probability $p_i$ of occurring as a $1$ and $1-p_i$ of occurring as $0$. A predictor gives out predicted probabilities $q_i$, and then events $E_i$ happen. Now what? How would we score the predictions? Equivalently, how would we fairly compensate the predictor?

A simple way to “score” the prediction is to say that for each event we have a “penalty” that is $|E_i - p_i|$, or a score that is $1- |E_i - p_i|$. For example, the prediction that the correct event happens with 100% probability gets a score of 1, but the prediction that the correct event happens with 85% probability gets a score of .85.

Unfortunately this scoring system is not “truthful,” that is, it does not encourage the predictor to tell us the true probabilities. For example suppose that a predictor has computed the probability of an event as 85% and is very confident in the accuracy of the model. Then, if he publishes the accurate prediction he is going to get a score of .85 with probability .85 and a score .15 with probability .15. So he is worse off than if he had published the prediction of the event happening with probability 100%, in which case the expected score is .85. In general, the scheme makes it always advantageous to round the probability to 0% or 100%.

Is there a truthful scoring system? I am not sure what the answer is.

If one is scoring multiple predictions of independent events, one can look at all the cases in which the prediction was, say, in the range of 80% to 90%, and see if indeed the event happened, say, a fraction between 75% and 95% of the times, and so on.

One disadvantage of this approach is that it seems to require a discretization of the probabilities, which seems like an arbitrary choice and one that could affect the final score quite substantially. Is there a more elegant way to score multiple independent events without resorting to discretization? Can it be proved to be truthful?

Another observation is that such an approach is still not entirely truthful if it is applied to events that happen sequentially. Indeed, suppose that we have a series of, say, 10 events for which we predicted a 60% probability of a 1, and the event 1 happened 7 out of 10 times. Now we have to make a prediction of a new event, for which our model predicts a 10% probability. We may then want to publish a 60% prediction, because this will help even out the “bucket” of 60% predictions.

I don’t think that there is any way around the previous problem, though it seems clear that it would affect only a small fraction of the predictions. (The complexity theorists among the readers may remember similar ideas being used in a paper of Feigenbaum and Fortnow.)

Surely the task of scoring predictions must have been studied in countless papers, and the answers to the above questions must be well known, although I am not sure what are the right keywords to use to search for such work. In computer science, there are a lot of interesting results about using expert advice, but they are all concerned with how you score your own way of picking which expert to trust rather than the experts themselves. (This means that the predictions of the experts are not affected by the scoring system, unlike the setting discussed in this post.)

Please contribute ideas and references in the comments.