You are currently browsing the category archive for the ‘math’ category.

A few weeks ago, the Proceedings of the National Academy of Science published an article on a study conducted by a group of Cornell researchers at Facebook. They picked about 600,000 users and then, for a week, a subset of them saw fewer “negative” posts (up to 90% were filtered) than they would otherwise see, a subset saw fewer “positive” posts (same), and a control group got a random subset.

After the week, the users in the “negative” group posted fewer, and more negative, posts, and those in the “positive” group posted more, and more positive, posts.

Posts were classified according to an algorithm called LIWC2007.

The study run contrary to a conventional wisdom that people find it depressing to see on Facebook good things happening to their friends.

The paper has caused considerable controversy for being a study with human subjects conducted without explicit consent. Every university, including of course Cornell, requires experiments involving people to be approved by a special committee, and participants must sign informed consent forms. Facebook maintains that the study is consistent with its terms of service. The highly respected privacy organization EPIC has filed a complaint with the FTC. (And they have been concerned with Facebook’s term of service for a long time.)

Here I would like to explore a different angle: almost everybody thinks that observational studies about human behavior can be done without informed consent. This means that if the Cornell scientists had run an analysis on old Facebook data, with no manipulation of the feed generation algorithm, there would not have been such a concern.

At the same time, the number of posts that are fit for the feed of a typical user vastly exceed what can fit in one screen, and so there are algorithms that pick a rather small subset of posts that are evaluated to be of higher relevance, according to some scoring function. Now suppose that, if N posts fit on the screen, the algorithm picks the 2N highest scoring posts, and then randomly picks half of them. This seems rather reasonable because the scoring function is going to be an approximation of relevance anyway.

The United States has roughly 130 million Facebook subscriber. Suppose that the typical user looks, in a week, at 200 posts, which seems reasonable (in our case, those would be a random subset of roughly 400 posts). According to the PNAS study, roughly 50% of the posts are positive and 25% are negative, so of the initial 400, roughly 200 are positive and 100 are negative. Let’s look at the 100,000 users for which the random sampling picked the fewest positive posts: we would be expecting roughly 3 standard deviations below the mean, so about 80 positive posts instead of the expected 100; the 100,000 users with the fewest negative posts would get about 35 instead of the expected 50.

This is much less variance than in the PNAS study, where they would have got, respectively, only 10 positive and only 5 negative, but it may have been enough to pick up a signal.

Apart from the calculations, which I probably got wrong anyway, what we have is that in the PNAS study they picked a subset of people and then they varied the distribution of posts, while in the second case you pick random posts for everybody and then you select the users with the most variance.

If you could arrange distributions so that the distributions of posts seen by each users are the same, would it really be correct to view one study as experimental and one as observational? If the PNAS study had filtered 20% instead of 90% of the positive/negative posts, would it have been ethical? Does it matter what is the intention when designing the randomized algorithm that selects posts? If Facebook were to introduce randomness in the scoring algorithm with the goal of later running observational studies would it be ethical? Would they need to let people opt out? I genuinely don’t know the answer to these questions, but I haven’t seen them discussed elsewhere.

I have always known that I don’t completely understand how fiat money works, however I have recently realized that I don’t understand it at all! Maybe some of my readers can clarify my confusion.

So let’s start a national economy from scratch, without international trade: so a group of people move to a deserted island, declare independence, the people have all kind of skills, they bring all kind of materials and machineries with them, the island has all kinds of natural resources and maybe a lottery system gives some people ownership of various plots of lands and mining rights.

Now some would-be entrepreneurs would like to begin hiring people with the right skills, buying or renting various stuff, and start some businesses. It seems that there is no loss of generality if I think of the entrepreneurs as just one person for the sake of what I want to think about. She is going to need to get a loan to start her business(es). Meanwhile, the new island government created a central bank, which issues theory dollars, or thollars, which are the currency of the island. The central bank “creates” thollars and lends them to banks, then the banks keep a fractional reserve and lend to the entrepreneur. Again, for the sake of what I want to say, there is no loss of generality if I identify the central bank and the other banks as just one entity.

So the central bank lends thollars to the entrepreneur, and she uses the money to start the business and being to pay people. At this point money starts circulating in the economy, and people will hire gardeners and babysitters, and lawyers, they will give to charities, they will hire computer science theory tutors for their kids, they will buy and sell houses to each others, and, crucially, will buy whatever goods and services the entrepreneur is selling. Now she is making a profit and she can pay back the loan to the central bank and invest in more … wait, she can never pay back the loan!

That’s because all the thollars in circulation are the ones that the central bank lent her, so there is no way that, as the money circulates, she can ever make more money that she owes!

Ok, so maybe people will also take loans to buy houses and stuff, so that’s more money that circulates, but is it really the case that, overall, it is impossible for everybody to be debt-free? That all the cash that any debt-free person has needs to be compensated by an equivalent amount of debt from other people? This is not how things seem to be in practice.

Now, clearly it is possible for everybody to have positive net worth, because, at the start, people have stuff and that stuff is worth money, but it seems strange that not everybody can be debt-free.

Maybe the problem is deflation? That if the entrepreneur borrowed money to create a business that creates new wealth (because it makes stuff that people find more valuable than the value of the raw material and the value of the work that went into it), but the amount of circulating money stays the same, then there is deflation, and her debt is spiraling out of control in deflation-adjusted terms, even if the interest rate is zero?

It seems that there are only two ways in which you can have everybody be debt-free and have a positive amount of cash: (i) the central bank starts buying stocks of private companies, (ii) the government runs a deficit, and the central bank buys government debt.

Is this correct? Normally, we think of people and companies being debt-free (or having more cash than debt) as ideal, and a government running no deficit as ideal, and usually central banks don’t buy stocks (they only buy bonds, which is formally equivalent to lending), so are these three conditions contradictory?

Where you least expect them:

a common [definiton] for “population” is a geographical cluster of people who mate more within the cluster than outside of it

The Simons Institute for the Theory of Computing at Berkeley has started operations a couple of months ago, it has been off to a great start. This semester there is a program on applications of real analysis to computer science, in which I am involved, and one on “big data,” whose workshops have been having such high attendance that they had to be organized offsite.

The institute itself is housed in a beautiful circular three-story building, half of whose second floor is an open space with sofas, whiteboards, tall ceilings, big windows, and exposed pipes on the ceiling, for an added loft-like look. If it had a ball pit, a foosball table, and free sushi it would look like the offices of a startup.

Next year, there will be programs on spectral graph theory, on applications of algebraic geometry, and on information theory.

Junior people (senior graduate students, postdocs, and junior assistant professors who have received their PhD no longer than six years ago) can participate in the program of the institute as “fellows.” Information on the fellowships is at simons.berkeley.edu/fellows2014; the deadline to apply is December 15.

This year is the centennial of Paul Erdős’s birth. Erdős lived most of his adult life as a traveling mathematician, “couchsurfing,” as we would say now, from place to place and from mathematical conference to mathematical conference. He wrote more than 1,500 papers with more than 500 different coauthors, introduced the probabilistic method and was the defining figure of the “Hungarian approach” to combinatorics. He died at age 83 while attending a mathematical conference.

Last year, we celebrated the centennial of Alan Turing’s birth. Turing and Erdős have become such iconic figures both for the impact of their work and for the fascinating facts of their lives. I would like to argue that the cultural archetype through which we interpret their lives is that of the saint. It is clearly that of the martyr saint in the case of Turing, while Erdős gave up material possessions and devoted his life to others, traveling everywhere and “preaching” to everybody, much in the mold of Saint Francis.

(A comparison of the Turing centennial celebration and Erdős’s, and a look at the frescoes of Medieval Catholic churches will show which kind of saint people are more interested in.)

The first step to become a saint of the Catholic church is to establish that the person exhibited “heroic virtues,” which is a great expression. This is an archetype that is not restricted to religion: you see it occurring in communist propaganda (Stakhanov, Lei Feng) and in every civil rights movement.

Saints were the “celebrities” of the Middle Ages, those whose life people liked to talk about. But contemporary celebrities come from a totally different archetype, that of the Greek God. Greek (and Roman) gods were petty and jealous, they cheated on their spouses, they were terrible parents, but there were good stories to be told about them. We don’t want (at least, I don’t) to live the life of a saint, but thinking about them is certainly inspirational and it makes us think that if someone can be so much better than us, maybe we can be a little better ourself in the practice of “virtues”, whatever this may mean to us. And we don’t admire gods, but, well, it’s probably fun to be one.

As usual, I have lost track of what I was trying to say, but I think that it speaks well of the academic community that we are more interested in saints than in gods, I will close by saying that my favorite saint of complexity theory is Avi Wigderson, I will keep to myself who my favorite god of complexity theory is, and I will leave it to the readers to contribute their picks.

This week, the topic of my online course on graph partitioning and expanders is the computation of approximate eigenvalues and eigenvectors with the power method.

If $M$ is a positive semidefinite matrix (a symmetric matrix all whose eigenvalues are nonnegative), then the power method is simply to pick a random vector $x\in \{ -1,+1 \}^n$, and compute $y:= M^k x$. If $k$ is of the order of $\frac 1 \epsilon \log \frac n \epsilon$, then one has a constant probability that

$\frac {y^T M y}{y^T y} \geq (1-\epsilon) \max_{x} \frac {x^T M x}{x^T x} = (1-\epsilon) \lambda_1$

where $\lambda_1$ is the largest eigenvalue of $M$. If we are interested in the Laplacian matrix $L = I - \frac 1d A$ of a $d$-regular graph, where $A$ is the adjacency matrix of the graph, this gives a way to compute an approximation of the largest eigenvalue, and a vector of approximately maximum Rayleigh quotient, which is useful to approximate Max Cut, but not to apply spectral partitioning algorithms. For those, we need a vector that approximates the eigenvector of the second smallest eigenvalue.

Equivalently, we want to approximate the second largest eigenvalue of the adjacency matrix $A$. The power method is easy to adjust to compute the second largest eigenvalue instead of the largest (if we know an eigenvector of the largest eigenvalue): after you pick the random vector, subtract the component of the vector that is parallel to the eigenvector of the largest eigenvalue. In the case of the adjacency matrix of a regular graph, subtract from every coordinate of the random vector the average of the coordinates.

The adjacency matrix is not positive semidefinite, but we can adjust it to be by adding a multiple of the identity matrix. For example we can work with $\frac 12 I + \frac 1{2d} A$. Then the power method reduces to the following procedure: pick randomly $x \sim \{ -1,1\}$, then subtract $\sum_i x_i/n$ from every entry of $x$, then repeat the following process $k = O\left( \frac 1 \epsilon \log \frac n \epsilon \right)$ times: for every entry $i$, assign $x_i := \frac 12 x_i + \frac 1 {2d} \sum_{j: (i,j) \in E} x_j$, that is, replace the value that the vector assigns to vertex $i$ with a convex combination of the current value and the current value of the neighbors. (Note that one iteration can be executed in time $O(|V|+|E|)$.

The problem is that if we started from a graph whose Laplacian matrix has a second smallest eigenvalue $\lambda_2$, the matrix $\frac 12 I + \frac 1{2d} A$ has second largest eigenvalue $1- \frac {\lambda_2}2$, and if the power method finds a vector of Rayleigh quotient at least $(1-\epsilon) \cdot \left( 1- \frac {\lambda_2}2 \right)$ for $\frac 12 I + \frac 1{2d} A$, then that vector has Rayleigh quotient about $\lambda_2 - 2\epsilon$ for $L$, and unless we choose $\epsilon$ of the same order as $\lambda_2$ we get nothing. This means that the number of iterations has to be about $1/\lambda_2$, which can be quite large.

The video below (taken from this week’s lecture) shows how slowly the power method progresses on a small cycle with 31 vertices. It goes faster on the hypercube, which has a much larger $\lambda_2$.

A better way to apply the power method to find small eigenvalues of the Laplacian is to apply the power method to the pseudoinverse $L^+$ of the Laplacian. If the Laplacian of a connected graph has eigenvalues $0 = \lambda_1 < \lambda_2 \leq \cdots \leq \lambda_n$, then the pseudoinverse $L^+$ has eigenvalues $0, \frac 1 {\lambda_2}, \cdots, \frac 1 {\lambda_n}$ with the same eigenvectors, so approximately finding the largest eigenvalue of $L^+$ is the same problem as approximately finding the second smallest eigenvalue of $L$.

Although we do not have fast algorithms to compute $L^+$, what we need to run the power method is, for a given $x$, to find the $y$ such that $L y = x$, that is, to solve the linear system $Ly = x$ in $y$ given $L$ and $x$.

For this problem, Spielman and Teng gave an algorithm nearly linear in the number of nonzero of $L$, and new algorithms have been developed more recently (and with some promise of being practical) by Koutis, Miller and Peng and by Kelner, Orecchia, Sidford and Zhu.

Coincidentally, just this week, Nisheeth Vishnoi has completed his monograph Lx=b on algorithms to solve such linear systems and their applications. It’s going to be great summer reading for those long days at the beach.

Long in the making, the online course on expanders starts today.

In the first week of class: what are the topics of the course, and how to prove that the eigenvalues of the adjacency matrix of a regular graph tell you how many connected components there are in the graph.

[I have been asked by the office of public affairs of the Institute for Advanced Study to publicize the following press release. L.T.]

April 1, 2013. For immediate release.

Cofounders Jean Bourgain and Peter Sarnak announce today the launch of eXpandr, a new venture that aims to become the world’s leading provider of expander graphs.

“We are excited about our mission to change the way the world uses expanders.” said CEO Guli Mars, who joined eXpandr after a distinguished career in several leading technology companies. “Expanders are vital to revenue-generating logarithms, and our technology will revolutionize a multi-billion dollar market.”

“Big data, disruption”, said Juan Raman, senior vice president for marketing. “Innovation, cloud computing”, Mr. Raman continued.

“Let p be a prime congruent to 1 modulo 4″ said Jean Bourgain, cofounder and senior vice-president for analytic number theory, “and consider the irreducible representations of PSL(2,p).”

About the Institute for Advanced Study. The Institute for Advanced Study is one of the world’s leading centers for theoretical research and intellectual inquiry. The Institute exists to encourage and support fundamental research in the sciences and humanities—the original, often speculative thinking that produces advances in knowledge that change the way we understand the world. Work at the Institute takes place in four Schools: Historical Studies, Mathematics, Natural Sciences and Social Science. It provides for the mentoring of scholars by a permanent Faculty of no more than 28, and it offers all who work there the freedom to undertake research that will make significant contributions in any of the broad range of fields in the sciences and humanities studied at the Institute.

The Institute, founded in 1930, is a private, independent academic institution located in Princeton, New Jersey. Its more than 6,000 former Members hold positions of intellectual and scientific leadership throughout the academic world. Some 33 Nobel Laureates and 38 out of 52 Fields Medalists, as well as many winners of the Wolf or MacArthur prizes, have been affiliated with the Institute.

About eXpandr. eXpandr aims to disrupt the way the world uses expander graphs, and to become the leading commercial provider of expanders. eXpandr received \$2 million in angel investing and will launch its first product by Summer 2013.

Having (non-rigorously) defined the Laplacian operator in manifolds in the previous post, we turn to the proof of the Cheeger inequality in manifolds, which we restate below.

Theorem 1 (Cheeger’s inequality) Let ${M}$ be an ${n}$-dimensional smooth, compact, Riemann manifold without boundary with metric ${g}$, let ${L:= - {\rm div} \nabla}$ be the Laplace-Beltrami operator on ${M}$, let ${0=\lambda_1 \leq \lambda_2 \leq \cdots }$ be the eigenvalues of ${L}$, and define the Cheeger constant of ${M}$ to be

$\displaystyle h(M):= \inf_{S\subseteq M : \ 0 < \mu(S) \leq \frac 12 \mu(M)} \ \frac{\mu_{n-1}(\partial(S))}{\mu(S)}$

where the ${\partial (S)}$ is the boundary of ${S}$, ${\mu}$ is the ${n}$-dimensional measure, and ${\mu_{n-1}}$ is ${(n-1)}$-th dimensional measure defined using ${g}$. Then

$\displaystyle h(M) \leq 2 \sqrt{\lambda_2} \ \ \ \ \ (1)$

We begin by recalling the proof of the analogous result in graphs, and then we will repeat the same steps in the context of manifolds.

Theorem 2 (Cheeger’s inequality in graphs) Let ${G=(V,E)}$ be a ${d}$-regular graph, ${A}$ be its adjacency matrix, ${L:= I - \frac 1d A}$ be its normalized Laplacian matrix, ${0 = \lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_{|V|}}$ be the eigenvalues of ${L}$, and define ${\mu(S):= d \cdot |S|}$ for every subset of vertices ${S\subseteq V}$. Define the conductance of ${G}$ as

$\displaystyle \phi(G) := \min_{S\subseteq V: \ 0 < \mu(S) \leq \frac 12 \mu(V) } \frac{| \partial S|}{\mu(S)}$

where ${\partial S}$ is the number of edges with one endpoint in ${S}$ and one endpoint in ${\bar S}$. Then

$\displaystyle \phi(G) \leq \sqrt{2 \lambda_2} \ \ \ \ \ (2)$

1. Proof the Cheeger inequality in graphs

We will use the variational characterization of the eigenvalues of the Laplacian ${L}$ of a graph ${G}$.

$\displaystyle \lambda_1 = \min_{f \in {\mathbb R}^v} \frac {f^T Lf}{f^T f} \ \ \ \ \ (3)$

and if ${f_1}$ is a minimizer in the above expression then

$\displaystyle \lambda_2 = \min_{f \in {\mathbb R}^V: \ f \perp f_1} \frac {f^T Lf}{f^T f}$

Following the definition of ${L}$ we see that

$\displaystyle f^T L f = \frac 1d \sum_{(u,v)\in E} |f_u - f_v |^2$

and so the minimum in (3) is 0, and it is achieved for ${f = (1,\cdots,1)}$. This means that

$\displaystyle \lambda_2 = \min_{f \in {\mathbb R}^V: \ \sum_v f_v = 0} \frac {\sum_{(u,v) \in E} |f_u-f_v|^2}{d \sum_v f_v^2} \ \ \ \ \ (4)$

The expression in the right-hand-side of (4) is an important one, and it is called the Rayleigh quotient of ${f}$, which we will denote by ${R(f)}$:

$\displaystyle R(f):= \frac {\sum_{(u,v) \in E} |f_u-f_v|^2}{d \sum_v f_v^2}$

It is also useful to consider the variant of the Rayleigh quotient where there are no squares; this does not have a standard name, so let us call it the ${\ell_1}$ Rayleigh quotient and denote it by ${R_1}$:

$\displaystyle R_1(g):= \frac {\sum_{(u,v) \in E} |g_u-g_v|}{d \sum_v |g_v|}$

The proof of the graph Cheeger inequality now continues with the proof of the following three facts.

Lemma 3 (Rounding of ${\ell_1}$ embeddings) For every non-negative vector ${g\in {\mathbb R}^V_{\geq 0}}$ there is a value ${t\geq 0}$ such that

$\displaystyle \phi(\{ v: g(v) > t \}) \leq R_1(g)$

Lemma 4 (Embedding of ${\ell_2^2}$ into ${\ell_1}$) For every non-negative vector ${f\in {\mathbb R}^V_{\geq 0}}$, we have

$\displaystyle R_1(f^2) \leq \sqrt{2 R(f)}$

Lemma 5 (From an eigenvector to a non-negative vector) For every ${f\in {\mathbb R}^V}$ such that ${\sum_v f_v =0}$ there is a non-negative ${f'\in {\mathbb R}^V_{\geq 0}}$ such that ${\mu(\{ v: f'(v) >0 \}) \leq \frac 12 \mu(V)}$ and such that

$\displaystyle R(f') \leq R(f)$

Now let us start from a function ${f}$ that optimizes (4), so that ${\sum_v f_v = 0}$ and ${R(f) = \lambda_2}$, then apply Lemma 5 to find a function ${f'}$ such that the volume of the vertices having positive coordinates in ${f'}$ is at most ${\frac 12 \mu(V)}$ and such that ${R(f') \leq R(f) = \lambda_2}$. Then consider the vector ${g\in {\mathbb R}^V}$ such that ${g_v := f'^2_v}$; by Lemma 4, we have ${R_1(g) \leq \sqrt{2 R(f')} \leq \sqrt{2\lambda_2}}$, and by Lemma 3 there is a threshold ${t}$ such that the set ${S:= \{ v: g(v) > t \}}$ has conductance ${h(S) \leq \sqrt {2\lambda_2}}$. Since ${S}$ is a subset of the vertices having positive coordinates in ${g}$, we have ${\mu(S) \leq \frac 12 \mu(V)}$, and so

$\displaystyle \phi(G) \leq \sqrt{2 \lambda_2 }$

which is the Cheeger inequality for graphs. It remains to prove the three lemmas.

Proof: of Lemma 3. For each threshold ${t}$, define the set

$\displaystyle S_t := \{ v: g_v > t \}$

The idea of the proof is that if we pick ${t}$ at random then the probability that an edge belongs to ${\partial S_t}$ is proportional to ${|g_u - g_v|}$ and the probability that ${v\in S_t}$ is proportional to ${|g_v|}$, so that the expected number of edges in ${\partial S_t}$ is proportional to the numerator of ${R_1(g)}$ and the expected number of vertices in ${S_t}$ is proportional to the denominator of ${R_1(g)}$; if ${R_1(g)}$ is small, it is not possible for ${\partial S_t/\mu(S_t)}$ to always be large for every ${t}$.

To avoid having to normalize the range of ${t}$ to be between ${0}$ and ${1}$, instead of taking averages over a random choice of ${t}$, we will consider the integral over all values of ${t}$. We have

$\displaystyle \int_0^\infty |\partial (S_t) | {\rm d} t = \sum_{u,v} |g_u - g_v |$

because we can write ${|\partial (S_t)| = \sum_{(u,v) \in E} I_{u,v} (t)}$, where ${I_{u,v} (t) = 1}$ if ${(u,v) \in \partial S_t}$ and ${I_{u,v}(t) = 0}$ otherwise, and we see that only the values of ${t}$ between ${g_u}$ and ${g_v}$ make ${I_{u,v}(t) =1}$, so ${\int_{0}^\infty I_{u,v} (t) {\rm d} t = |g_u - g_v|}$.

We also have

$\displaystyle \int_0^\infty d |S_t| {\rm d} t = d \sum_v |g_v|$

and if we denote by ${t^*}$ the threshold such that ${\phi(S_{t^*})}$ is smallest among all the ${\phi(S)}$, then

$\displaystyle \sum_{u,v} |g_u - g_v | = \int_0^\infty |\partial (S_t) | {\rm d} t$

$\displaystyle \geq \int_0^\infty h(S_{t^*}) d|S_t| {\rm d} t$

$\displaystyle = h(S_{t^*}) d \sum_v |g_v|$

so that

$\displaystyle h(S_{t^*}) \leq \frac { \sum_{u,v} |g_u - g_v | }{d \sum_v |g_v| } = R_1(g)$

$\Box$

Proof: of Lemma 3. Let us consider the numerator of ${R_1(f^2)}$; it is:

$\displaystyle \sum_{(u,v)\in E} |f^2_u - f^2_v|$

$\displaystyle = \sum_{(u,v)\in E} |f_u - f_v| \cdot (f_u + f_v)$

$\displaystyle \leq \sqrt{\sum_{(u,v)\in E} |f_u - f_v|^2} \sqrt{ \sum_{(u,v)\in E}(f_u + f_v)^2 }$

(we used Cauchy-Swarz)

$\displaystyle \leq \sqrt{R(f) \cdot d\sum_v f_v^2} \sqrt{ \sum_{(u,v)\in E} 2f_u^2 + 2f_v^2 }$

(we used the definition of ${R_2}$ and Cauchy-Swarz again)

$\displaystyle = \sqrt{R(f) \cdot d\sum_v |f_v|^2} \sqrt{ 2 d\sum_{v} f_v^2 }$

$\displaystyle = \sqrt{2 R(f) } \cdot d \sum_v f_v^2$

And so

$\displaystyle R_1(f^2) = \frac {\sum_{(u,v)\in E} |f^2_u - f^2_v|}{d \sum_v f_v^2} \leq \sqrt{2 R_2(f) }$

$\Box$

Proof: of Lemma 5. Let ${m}$ be the median of ${f}$, and consider ${\bar f}$ defined as ${\bar f_v := f_v - m}$. We have

$\displaystyle R(\bar f) \leq R(f)$

because the numerators of ${R(\bar f)}$ and ${R(f)}$ are the same (the additive term ${-m}$ cancels). The denominators are such that

$\displaystyle \sum_v \bar f_v^2 = || \bar f||^2 = || f||^2 + || - m \cdot {\bf 1} ||^2 \geq || f||^2 = \sum_v f_v^2$

because ${f}$ and the vector ${{\bf 1} = (1,\ldots,1)}$ are orthogonal, and so by Pythagoras’s theorem the length-squared of ${\bar f = f - m {\bf 1}}$ equals the length-squared of ${f}$ plus the length-squared of ${-m {\bf 1}}$.

Let us define ${f^+_v := \min\{ 0, \bar f_v\}}$ and ${f^-_v := \min \{ 0, -\bar f_v\}}$ so that ${\bar f = f^+ - f^-}$. We use the following fact:

Fact 6 Let ${a,b \in {\mathbb R}^V_{\geq 0}}$ be disjointly supported non-negative vectors (“disjointly supported” means that they are non-zero on disjoint subsets of coordinates), then

$\displaystyle \min\{ R(a) , R(b) \} \leq R(a-b)$

Proof: The numerator of ${R(a-b)}$ is

$\displaystyle \sum_{(u,v)\in E} |a_v - b_v + b_u - a_v|^2 \geq \sum_{(u,v)\in E} |a_v - a_u|^2 + |b_v - b_u|^2$

and, using orthogonality and Pythagoras’s theorem, the denominator of ${R(a-b)}$ is

$\displaystyle d || a- b||^2 = d||a ||^2 + d|| b||^2$

The fact now follows from the inequality

$\displaystyle \min \left \{ \frac {n_1}{d_1} , \frac{n_2}{d_2} \right \} \leq \frac{n_1+n_2}{d_1+d_2}$

$\Box$

The lemma now follows by observing that ${f^+}$ and ${f^-}$ are non-negative and disjointly supported, so

$\displaystyle \min\{ R(f^+) , R(f^-) \} \leq R(\bar f) \leq R(f)$

and that both ${f^+}$ and ${f^-}$ have at most ${n/2}$ non-zero coordinate. $\Box$

2. Proof of the Cheeger inequality in manifolds

We will now translate the proof of the graph Cheeger inequality to the setting of manifolds.

As you may remember, we started off by saying that ${L}$ is symmetric and so all its eigenvalues are real and they are given by the variational characterization. Now we are already in trouble because the operator ${L}$ on manifolds cannot be thought of as a matrix, so what does it mean for it to be symmetric? The consequence of symmetry that is exploited in the analysis of the spectrum of symmetric matrices is the fact that if ${A}$ is symmetric, then for every ${x,y}$ we have

$\displaystyle \langle x,Ay \rangle = x^T Ay = (A^Tx)^Ty = (Ax)^Ty = \langle Ax,y \rangle$

and the property ${\langle x, Ay \rangle = \langle Ax , y \rangle}$ makes no references to coordinates, and it is well defined even for linear operators over infinite-dimensional spaces, provided that there is a notion of inner product. If we the define the inner product

$\displaystyle \langle f,g \rangle := \int_M f(x)\cdot g(x) \ {\rm d} \mu(x)$

on functions ${f: M \rightarrow {\mathbb R}}$, and more generally

$\displaystyle \langle f,g \rangle := \int_M \langle f(x), g(x)\rangle_X \ {\rm d} \mu(x)$

for functions ${f: M \rightarrow X}$, where ${X}$ is a vector space with inner product ${\langle \cdot,\cdot,\rangle_X}$, then we can say that an operator ${A}$ is self-adjoint if

$\displaystyle \langle f, Ag \rangle = \langle Af ,g \rangle$

for all (appropriately restricted) functions ${f,g}$. If ${M}$ is compact, this property is true for the Laplacian, and, in particular, ${-{\rm div}}$ and ${\nabla}$ are adjoints of each others, that is,

$\displaystyle \langle \nabla f, g \rangle = \langle f, - {\rm div} g \rangle$

(The discrete analog would be that ${C^T}$ is the transpose of ${C}$.)

Self-adjointness (and appropriate conditions on ${M}$) imply a version of the spectral theorem and of the variational characterization. In particular, all eigenvalues of ${L}$ are real, and if there is a minimum one then it is

$\displaystyle \lambda_1 = \min_{f: M \rightarrow {\mathbb R}} \frac {\langle f, Lf \rangle}{\langle f,f\rangle}$

and if ${f_1}$ is a minimizer of the above, then

$\displaystyle \lambda_2 = \min_{f: M \rightarrow {\mathbb R},\ \langle f, f_1 \rangle = 0} \frac {\langle f, Lf \rangle}{\langle f,f\rangle}$

(The minimization is quantified over all functions that are square-integrable, and the minimum is achieved because if ${M}$ is compact then the space of such functions is also compact and the cost function that we are minimizing is continuous. In this post, whenever we talk about “all functions,” it should be understood that we are restricting to whatever space of functions makes sense in the context.)

From the property that ${-{\rm div}}$ and ${\nabla}$ are adjoint, we have

$\displaystyle \langle f, Lf \rangle = \langle f, -{\rm div} \nabla f \rangle = \langle \nabla f,\nabla f \rangle$

so

$\displaystyle \lambda_1 = \min_{f:M \rightarrow {\mathbb R}} \frac{ \int_M \langle \nabla f(x),\nabla f(x) \rangle \ {\rm d} \mu(x) } {\int M f^2(x) \ {\rm d} \mu (x)}$

where the Rayleigh quotient

$\displaystyle R(f):= \frac{ \int_M \langle \nabla f(x),\nabla f(x) \rangle \ {\rm d} \mu(x) } {\int M f^2(x) \ {\rm d} \mu (x)} = \frac {\int_M ||\nabla f(x)||^2 \ {\rm d} \mu (x)}{\int_M f^2(x) \ {\rm d} \mu (x)}$

is always non-negative, and it is zero for constant ${f\equiv 1}$, so we see that ${\lambda_1=0}$ and

$\displaystyle \lambda_2 = \min_{f:M \rightarrow {\mathbb R} : \int_M f=0 \ {\rm d} \mu} R(f)$

By analogy with the graph case, we define the “${\ell_1}$ Rayleigh quotient”

$\displaystyle R_1(g) := \frac {\int_M ||\nabla g(x)|| \ {\rm d} \mu (x)}{\int_M |g(x)| \ {\rm d} \mu (x)}$

And we can prove the analogs of the lemmas that we proved for graphs.

Lemma 7 (Rounding of ${\ell_1}$ embeddings) For every non-negative function ${g: V \rightarrow {\mathbb R}_{\geq 0}}$ there is a value ${t\geq 0}$ such that

$\displaystyle h(\{ x: g(x) > t \}) \leq R_1(g)$

where the Cheeger constant ${h(S)}$ of a subset ${S\subseteq M}$ of the manifold is

$\displaystyle h(S) := \frac{\mu_{n-1} (\partial S)}{\mu(S)}$

Lemma 8 (Embedding of ${\ell_2^2}$ into ${\ell_1}$) For every non-negative function ${f; m \rightarrow {\mathbb R}_{\geq 0}}$, we have

$\displaystyle R_1(f^2) \leq \sqrt{2 R(f)}$

Lemma 9 (From an eigenfunction to a non-negative function) For every function ${f: M \rightarrow R}$ such that ${\int_M f \ {\rm d} \mu =0}$ there is a non-negative ${f': M \rightarrow {\mathbb R}_{\geq 0}}$ such that ${\mu(\{ x: f'(x) >0 \}) \leq \frac 12 \mu(V)}$ and such that

$\displaystyle R(f') \leq R_2(f)$

Let us see the proof of these lemmas.

Proof: of Lemma 7. For each threshold ${t}$, define the set

$\displaystyle S_t := \{ x: g(x) > t \}$

Let ${t*}$ be a threshold for which ${h(S_{t*})}$ is minimized

We will integrate the numerator and denominator of ${h(S_t)}$ over all ${t}$. The coarea formula for nonnegative functions is

$\displaystyle \int_M || \nabla g || {\rm d} \mu = \int_{0}^\infty \mu_{n-1} ( \partial \{ x: g(x) > t \}) {\rm d} t$

and we also have

$\displaystyle \int_M |g| {\rm d} \mu = \int_{0}^\infty \mu ( \{ x: g(x) > t \}) {\rm d} t$

which combine to

$\displaystyle \int_M || \nabla g || {\rm d} t = \int_0^\infty \mu_{n-1} ( \partial S_t) {\rm d} t$

$\displaystyle = \int_0^\infty h(S_t) \mu(S_t) {\rm d} t$

$\displaystyle \geq h(S_{t^*}) \int_0^\infty \mu(S_t) {\rm d} t$

$\displaystyle = h(S_{t^*}) \int_M g {\rm d} t$

so that

$\displaystyle h(S_{t^*}) \leq \frac{\int_M || \nabla g || {\rm d} t }{ \int_M g {\rm d} t} = R_1(g)$

$\Box$

Proof: of Lemma 8. Let us consider the numerator of ${R_1(f^2)}$; it is:

$\displaystyle \int_M || \nabla f^2|| {\rm d} \mu$

We can apply the chain rule, and see that

$\displaystyle \nabla f^2(x) = 2f(x) \cdot \nabla f(x)$

which implies

$\displaystyle \int_M || \nabla f^2 || {\rm d} \mu$

$\displaystyle = \int_M || 2f(x) \nabla f(x)|| {\rm d} \mu(x)$

$\displaystyle = \int_M 2|f(x)| \cdot ||\nabla f(x)|| {\rm d} \mu (x)$

and, after applying Caucy-Swarz,

$\displaystyle \leq \sqrt{\int_M 4 f^2(x) {\rm d} \mu(x)} \cdot \sqrt{\int_M ||\nabla f(x)||^2 {\rm d} \mu(x)}$

$\displaystyle = 2 \cdot \left( \int_M f^2 {\rm d} \mu \right) \cdot \sqrt{ R(f)}$

And so

$\displaystyle R_1(f^2) \leq 2 \sqrt{R_2(f) }$

$\Box$

Proof: of Lemma 9. Let ${m}$ be a median of ${f}$, and consider ${\bar f}$ defined as ${\bar f(x) := f(x) - m}$. We have

$\displaystyle R(\bar f) \leq R(f)$

because the numerators of ${R(\bar f)}$ and ${R(f)}$ are the same (the derivatives of functions that differ by a constant are identical) and the denominators are such that

$\displaystyle \int_M \bar f^2{\rm d} \mu$

$\displaystyle = \int_M f^2 - 2mf + m^2 {\rm d} \mu$

$\displaystyle = \int_M f^2 + m^2 {\rm d} \mu$

$\displaystyle \geq \int_M f^2 {\rm d} \mu$

where we used the fact the integral of ${f}$ is zero.

Let us define ${f^+(x) := \min\{ 0, \bar f(x)\}}$ and ${f^-_v := \min \{ 0, -\bar f(x)\}}$ so that ${\bar f(x) = f^+(x) - f^-(x)}$. We use the following fact:

Fact 10 Let ${a,b : M \rightarrow {\mathbb R}_{\geq 0}}$ be disjointly supported non-negative functions (“disjointly supported” means that they are non-zero on disjoint subsets of inputs), then

$\displaystyle \min\{ R(a) , R(b) \} \leq R(a-b)$

Proof: We begin with the following observation: if ${a}$ is a non-negative function, and ${a(x)=0}$, then ${\nabla a(x) = \{ \bf 0 \}}$, because ${x}$ has to be a local minimum.

Consider the expression ${||\nabla (a-b)||^2}$ occurring in the numerator of ${R(a-b)}$. We have

$\displaystyle ||\nabla (a-b)||^2$

$\displaystyle = || \nabla a - \nabla b ||^2$

$\displaystyle = || \nabla a||^2 + || \nabla b||^2 - 2 \langle \nabla a,\nabla b \rangle$

But

$\displaystyle \langle \nabla a,\nabla b \rangle = 0$

because for every ${x}$ at least one of ${a(x)}$ or ${b(x)}$ is zero, and so at least one of ${\nabla a(x)}$ or ${\nabla b(x)}$ is zero.

Using this fact, we have that the numerator of ${R(a-b)}$ is equal to the sum of the numerators of ${R(a)}$ and ${R(b)}$:

$\displaystyle \int_M ||\nabla a-b||^2 {\rm d} \mu = \int_M ||\nabla a-b||^2 {\rm d} \mu + \int_M ||\nabla a-b||^2 {\rm d}$

and the denominator of ${R(a-b)}$ is also the sum of the denominators of ${R(a)}$ and ${R(b)}$:

$\displaystyle =\int_M (a-b)^2 {\rm d} \mu$

$\displaystyle =\int_M a^2 {\rm d} \mu + \int_M b^2 {\rm d} \mu - 2 \int_M ab {\rm d} \mu$

$\displaystyle =\int_M a^2 {\rm d} \mu + \int_M b^2 {\rm d} \mu$

because ${a(x)b(x)=0}$ for every ${x}$. The fact now follows from the inequality

$\displaystyle \min \left \{ \frac {n_1}{d_1} , \frac{n_2}{d_2} \right \} \leq \frac{n_1+n_2}{d_1+d_2}$

$\Box$

The lemma now follows by observing that ${f^+}$ and ${f^-}$ are non-negative and disjointly supported, so

$\displaystyle \min\{ R(f^+) , R(f^-) \} \leq R(\bar f) \leq R(f)$

and that both ${f^+}$ and ${f^-}$ have a support of volume at most ${\frac 12 \mu(M)}$. $\Box$

If anybody is still reading, it is worth observing a couple of differences between the discrete proof and the continuous proof.

The ${\ell_1}$ Rayleigh quotient is defined slightly differently in the continuous case. It would correspond to defining it as

$\displaystyle \frac {\sum_{u\in V} \sqrt{\sum_{v: (u,v)\in E} (f_u-f_v)^2 } }{d \sum_v |f_v|}$

in the discrete case.

If ${a,b \in {\mathbb R}^V}$ are disjointly supported and nonnegative, the sum of the numerators of the Rayleigh quotients ${R(a)}$ and ${R(-b)}$ can be strictly smaller than the numerator of ${R(a-b)}$, while we always have equality in the continuous case. In the discrete case, the sum of the numerators of ${R(a)}$ and ${R(b)}$ can be up to twice the numerator of ${R(a+b)}$ (this fact is useful, but it did not come up in this proof), while again we have exact equality in the continuous case.

The chain rule calculation

$\displaystyle \nabla f^2 = 2f \nabla f$

corresponds to the step

$\displaystyle f_v^2 - f_u^2 = (f_v + f_u) \cdot (f_v - f_u)$

In the continuous case, ${f_v}$ and ${f_u}$ are “infinitesimally close”, so we can approximate ${f_v + f_u}$ by ${2f_v}$.

Readers of in theory have heard about Cheeger’s inequality a lot. It is a relation between the edge expansion (or, in graphs that are not regular, the conductance) of a graph and the second smallest eigenvalue of its Laplacian (a normalized version of the adjacency matrix). The inequality gives a worst-case analysis of the “sweep” algorithm for finding sparse cuts, it shows a necessary and sufficient for a graph to be an expander, and it relates the mixing time of a graph to its conductance.

Readers who have heard this story before will recall that a version of this result for vertex expansion was first proved by Alon and Milman, and the result for edge expansion appeared in a paper of Dodzuik, all from the mid-1980s. The result, however, is not called Cheeger’s inequality just because of Stigler’s rule: Cheeger proved in the 1970s a very related result on manifolds, of which the result on graphs is the discrete analog.

So, what is the actual Cheeger’s inequality?

Theorem 1 (Cheeger’s inequality) Let ${M}$ be an ${n}$-dimensional smooth, compact, Riemann manifold without boundary with metric ${g}$, let ${L:= - {\rm div} \nabla}$ be the Laplace-Beltrami operator on ${M}$, let ${0=\lambda_1 \leq \lambda_2 \leq \cdots }$ be the eigenvalues of ${L}$, and define the Cheeger constant of ${M}$ to be

$\displaystyle h(M):= \inf_{S\subseteq M : \ 0 < \mu(S) \leq \frac 12 \mu(M)} \ \frac{\mu_{n-1}(\partial(S))}{\mu(S)}$

where the ${\partial (S)}$ is the boundary of ${S}$, ${\mu}$ is the ${n}$-dimensional measure, and ${\mu_{n-1}}$ is ${(n-1)}$-th dimensional measure defined using ${g}$. Then

$\displaystyle h(M) \leq 2 \sqrt{\lambda_2} \ \ \ \ \ (1)$

The purpose of this post is to describe to the reader who knows nothing about differential geometry and who does not remember much multivariate calculus (that is, the reader who is in the position I was in a few weeks ago) what the above statement means, to describe the proof, and to see that it is in fact the same proof as the proof of the statement about graphs.

In this post we will define the terms appearing in the above theorem, and see their relation with analogous notions in graphs. In the next post we will see the proof.