**1. Matrix Multiplicative Weights Update **

In this post we consider the following generalization, introduced and studied by Arora and Kale, of the “learning from expert advice” setting and the multiplicative weights update method. In the “experts” model, we have a repeated game in which, at each time step , we have the option of following the advice of one of experts; if we follow the advice of expert at time , we incur a loss of , which is unknown to us (although, at time we know the loss functions ). We are allowed to choose a probabilistic strategy, whereby we follow the advice of expert with probability , so that our expected loss at time is .

In the matrix version, instead of choosing an expert we are allowed to choose a unit -dimensional vector , and the loss incurred in choosing the vector is , where is an unknown symmetric matrix. We are also allowed to choose a probabilistic strategy, so that with probability we choose the unit vector , and we incur the expected loss

The above expression can also be written as

where and we used the Frobenius inner product among square matrices defined as . The matrices that can be obtained as convex combinations of rank-1 matrices of the form where is a unit vector are called *density matrices* and can be characterized as the set of positive semidefinite matrices whose trace is 1.

It is possible to see the above game as the “quantum version” of the experts settings. A choice of a unit vector is a *pure quantum state*, a probability distribution of pure quantum states, described by a density matrix, is a *mixed quantum state*. If is a density matrix describing a mixed quantum state, is a symmetric matrix, and is the spectral decomposition of in terms of its eigenvalues and orthonormal eigenvectors , then is the expected outcome of a measurement of in the basis , and such that is the value of the measurement if the outcome is .

If you have no idea what the above paragraph means, that is perfectly ok because this view will not be particularly helpful in motivating the algorithm and analysis that we will describe. (Here I am reminded of the joke about the way people from Naples give directions: “How do I get to the post office?”, “Well, you see that road over there? After the a couple of blocks there is a pharmacy, where my uncle used to work, though now he is retired.” “Ok?” “Now, if you turn left after the pharmacy, after a while you get to a square with a big fountain and the church of St. Anthony where my niece got married. It was a beautiful ceremony, but the food at the reception was not great.” “Yes, I know that square”, “Good, don’t go there, the post office is not that way. Now, if you instead take that other road over there …”)

The main point of the above game, and of the Matrix Multiplicative Weights Update (MMWU) algorithm that plays it with bounded regret, is that it provides useful generalizations of the standard “experts” game and of the Multiplicative Weights Update (MWU) algorithm. For example, as we have already seen, MWU can provide a “derandomization” of the Chernoff bound; we will see that MMWU provides a derandomization of the *matrix* Chernoff bound. MWU can be used to approximate certain Linear Programming problems; MMWU can be used to approximate certain *Semidefinite Programming* problems.

To define and analyze the MMWU algorithm, we need to introduce certain operations on matrices. We will always work with real-valued symmetric matrices, but everything generalizes to complex-valued Hermitian matrices. If is a symmetric matrix, are the eigenvalues of , and are corresponding orthonormal eigenvectors, then we will define a number of operations and functions on that operate on the eigenvalues while leaving the eigenvectors unchanged.

The first operation is *matrix exponentiation*: we define

The operation always defines a positive definite matrix, and the resulting matrix satisfies a “Taylor expansion”

Indeed, it is more common to use the above expansion as the definition of the matrix exponential, and then derive the expression in terms of eigenvalues.

We also have the useful bounds

which is true for every and

which is true for all such that .

Analogously, if is positive definite, we can define

and we have a number of identities like , , , where is a scalar. We should be careful, however, not to take the analogy with real numbers too far: for example, if and are two symmetric matrices, in general it is not trues that , in fact the above expression is actually always false except when and commute, in which case it is trivially true. We have, however, the following extremely useful fact.

Theorem 1 (Golden-Thompson Inequality)

The Golden-Thompson inequality will be all we need to generalize to this matrix setting everything we have proved about multiplicative weights. See this post by Terry Tao for a proof.

The *Von Neumann entropy* of a density matrix with eigenvalues is defined as

that is, if we view as the mixed quantum state in which the pure state has probability , then is the entropy of the distribution over the pure states. Again, this is not a particularly helpful point of view, and in fact we will be interested in defining not just for density matrices but for arbitrary positive definite matrices, and even positive semidefinite (with the convention that , which is used also in the standard definition of entropy of a distribution).

We will be interested in using Von Neumann entropy as a regularizer, and hence we will want to know what is its Bregman divergence. Some calculations show that the Bregman divergence of the Von Neumann entropy, which is called the quantum relative entropy, is

If and are density matrices, the terms cancel out; the above definition is valid for arbitrary positive definite matrices.

We will have to study the minima of various functions that take a matrix as an input, so it is good to understand how to compute the gradient of such functions. For example what is the gradient of the function ? Working through the definition we see that , and indeed we always have that the gradient of the function is everywhere. Somewhat less obvious is the calculation of the gradient of the Von Neumann entropy, which is

**2. Analysis in the Constrained FTRL Framework **

Suppose that we play that we described above using agile mirror descent and using negative Von Neumann entropy (appropriately scaled) as a regularizer. That is, for some that we will choose later, we use the regularizer

which has the Bregman divergence

and our feasible set is the set of density matrices

To bound the regret, we just have to plug the above definitions into the machinery that we developed in our fifth post.

At time 1, we play the identity matrix scaled by n, which is a density matrix of maximum Von Neumann entropy :

At time , we play the matrix obtained as

and recall that we proved that, after steps,

If is a density matrix with eigenvalues , then the first term is

To complete the analysis we have to understand . We need to compute the gradient and set it to zero. The gradient of is just . The gradient of is

Meaning that we want to solve for

and satisfies

and we can write

Then we can use Golden-Thompson and the fact that , which holds if , to write

Combining everything together we have

and so, provided ,

This is the best bound we can hope for, and it matches Theorem 1 in our first post about the Xultiplicative Weights Update algorithm.

If we have , we can simplify it to

where the last step comes from optimizing .

We can also write, under the condition ,

where is the “absolute value” of the matrix defined in the following way: if is a symmetric matrix, then its absolute value is . Allen-Zhu, Liao and Orecchia state the analysis in this way in their on generalizations of Matrix Multiplicative Weights.

Our next post will discuss applications at length, but for now let us gain a bit of intuition about the usefulness of these regret bounds. Recall that, for every symmetric matrix , we have

and so the regret bound can be reintepreted in the following way: if we let be the loss functions used in a game played against a MMWU algorithm, and the algorithm selects density matrices , then

that is,

provided that . For example, switching with , we have

provided that , which means that if we can choose a sequence of loss matrices that make the MMWU have small loss at each step, then we are guaranteed that the sum of such matrices cannot have any large eigenvalue.

]]>The negotiable start date is September 1st, 2022. Each position is for one year, renewable for a second. The positions offer an internationally competitive salary (up to 65,000 Euro per year, tax-free, plus relocation assistance and travel allowance), in a wonderful location that, at long last, is back to more or less normal life. The application deadline is **December 17, 2021**.

Among the topics that I am interested in are spectral graph theory, average-case complexity, “applications” of semidefinite programming, random processes on networks, approximation algorithms, pseudorandomness and combinatorial constructions.

Bocconi Computer Science is building up a theory group: besides me, we have Alon Rosen, Marek Elias, a tenured person that will join next Fall, and more hires are on the horizon. Now that traveling is ok again, and considering that Alon and I both have ERC grants, we should expect a big stream of theory visitors coming and going through Bocconi from week-long visits to semester or year long sabbaticals.

To apply, go to https://www.unibocconi.eu/faculty-postdoc and look for the position advertised as “BIDSA Informatics”, which looks like this:

and click on “apply online”. Currently it is the second position from the top in the list

]]>

**1. The Impagliazzo Hard-Core Lemma **

The Impagliazzo Hard-Core Lemma is a striking result in the theory of average-case complexity. Roughly speaking, it says that if is a function that is “weakly” hard on average for a class of “efficiently computable” functions , that is, if, for some , we have that

then there is a subset of cardinality such that is “strongly” hard-on-average on , meaning that

for a small . Thus, the reason why functions from make a mistake in predicting at least a fraction of the times is that there is a “hard-core” set of inputs such that every function from makes a mistake about 1/2 of the times for the fraction of inputs coming from .

The result is actually not literally true as stated above, and it is useful to understand a counterexample, in order to motivate the correct statement. Suppose that contains just functions, and that each function differs from in exactly a fraction of inputs from , and that the set of mistakes are *disjoint*. Thus, for every set , no matter its size, there is a function that agrees with on at least a fraction of inputs from . The reason is that the sets of inputs on which the functions of differ from form a partition of , and so their intersections with form a partition of . By an averaging argument, one of those intersections must then contain at most elements of .

In the above example, however, if we choose any three distinct functions from , we have

So, although is weakly hard on average with respect to , we have that is not even worst-case hard for a slight extension of in which we allow functions obtained by simple compositions of a small number of functions of .

Theorem 1 (Impagliazzo Hard-Core Lemma)Let be a collection of functions , let a function, and let and be positive reals. Then at least one of the following conditions is true:

- ( is not weakly hard-on-average over with respect to a slight extension of ) There is a , an integer , and functions , such that
satisfies

- ( is strongly hard-on-average over a set of density ) There is a set such that and

Where is equal to or depending on whether the boolean expression is true or false (the letter “” stands for “indicator” function of the truth of the expression).

**2. Proving the Lemma **

Impagliazzo’s proof had polynomial in both and , and an alternative proof discovered by Nisan has a stronger bound on of the order of . The proofs of Impagliazzo and Nisan did not immediately give a set of size (the set had size ), although this could be achieved by iterating their argument. An idea of Holenstein allows to prove the above statement in a more direct way.

Today we will see how to obtain the Impagliazzo Hard-Core Lemma from online optimization, as done by Barak, Hardt and Kale. Their proof achieves all the parameters claimed above, once combined with Holenstein’s ideas.

We say that a distribution (here “” stands for probability *measure*; we use this letter since we have already used last time to denote the Bregman divergence) has min-entropy at least if, for every , . In other words, the min-entropy of a distribution over a sample space is defined as

The uniform distribution over a set has min-entropy , and all distributions of min-entropy can be realized as a convex combination of distributions that are each uniform over a set of size , thus uniform distributions over large sets and large-min-entropy distributions are closely-related concepts. We will prove the following version of the hard-core lemma:

Theorem 2 (Impagliazzo Hard-Core Lemma — Min-Entropy Version)Let be a finite set, be a collection of functions , let a function, and let and be positive reals. Then at least one of the following conditions is true:

- ( is not weakly hard-on-average over with respect to ) There is a , an integer , and functions , such that
satisfies

- ( is strongly hard-on-average on a distribution of min-entropy ) There is a distribution of min-entropy such that

Under minimal assumptions on (that it contains functions), the min-entropy version implies the set version, and the min-entropy version can be used as-is to derive most of the interesting consequences of the set version.

Let us restate it one more time.

Theorem 3 (Impagliazzo Hard-Core Lemma — Min-Entropy Version)Let be a finite set, be a collection of functions , let a function, and let and be positive reals. Suppose that for every distribution of min-entropy we haveThen there is a , an integer , and functions , such that

satisfies

As in previous posts, we are going to think about a game between a “builder” that works toward the construction of and an “inspector” that looks for defects in the construction. More specifically, at every round , the inspector is going to pick a distribution of min-entropy and the builder is going to pick a function . The loss function, that the inspector wants to minimize, is

The inspector runs the agile online mirror descent algorithm with the constraint of picking distributions of the required min-entropy, and using the entropy regularizer; the builder always chooses a function such that that

which is always a possible choice given the assumptions of our theorem.

Just by plugging the above setting into the analysis from the previous post, we get that if we play this online game for steps, the builder picks functions such that, *for every distribution* of min-entropy , we have

We will prove that (1) holds in the next section, but we emphasize again that it is just a matter of mechanically using the analysis from the previous post. Impagliazzo’s proof relies, basically, on playing the game using lazy mirror descent with regularization, and he obtains a guarantee like the one above after steps.

What do we do with (1)? Impagliazzo’s original reasoning was to define

and to consider the set of “bad” inputs such that . We have

and so

The min-entropy of the uniform distribution over is , and this needs to be less than , so we conclude that happens for at most a fraction of elements of .

This is qualitatively what we promised, but it is off by a factor of 2 from what we stated above. The factor of 2 comes from a subsequent idea of Holenstein. In Holenstein’s analysis, we sort elements of according to

and he lets be the set of elements of for which the above quantity is smallest, and he shows that if we properly pick an integer and define

then will be equal to for all and also for at least half the , meaning that for at least a fraction of the input. Since this is a bit outside the scope of this series of posts, we will not give an exposition of Holenstein’s argument.

**3. Analysis of the Online Game **

It remains to show that we can achieve (1) with of the order of . As we said, we play a game in which, at every step

- The “inspector” player picks a distribution of min-entropy at least , that is, it picks a number for each such that .
- The “builder” player picks a function , whose existence is guaranteed by the assumption of the theorem, such that
and defines the loss function

- The “inspector” is charged the loss .

We analyze what happens if the inspector plays the strategy defined by agile mirror descent with negative entropy regularizer. Namely, we define the regularizer

for a choice of that we will fix later. The corresponding Bregman divergence is

and we work over the space of distributions of min-entropy .

The agile online mirror descent algorithm is

so that is the uniform distribution, and for

Solving the first step of agile online mirror descent, we have

Using the analysis from the previous post, for every distribution in , and every number of steps, we have the regret bound

and we can bound

and

where, in the last step, we used the fact the quantity in parenthesis is either 0 or which is , and that because is a distribution.

Overall, the regret is bounded by

where the last inequality comes from an optimized choice of .

Recall that we choose the functions so that for every , so for every

and by choosing of the order of we get

It remains to observe that

so we have that for every distribution of min-entropy at least it holds that

which is the statement that we promised and from which the Impagliazzo Hard-Core Lemma follows.

**4. Some Final Remarks **

After Impagliazzo circulated a preliminary version of his paper, Nisan had the following idea: consider the game that we define above, in which a builder picks an , an inspector picks a distribution of the prescribed min-entropy, and the loss for the inspector is given by . We can think of it as a zero-sum game if we also assign a gain to the builder.

If the builder plays second, there is a strategy that guarantees a gain that is at least , and so there must be a mixed strategy, that is, a distribution over functions in , that guarantees such a gain even if the builder plays first. In other words, for all distributions of the prescribed min-entropy we have

Nisan then observes that we can sample functions and have, with high probability

and the sampling bound on can be improved to order of with the same conclusion.

Basically, what we have been doing today is to come up with an algorithm that finds an approximate solution for the LP that defines the optimal mixed strategy for the game, and to design the algorithm is such a way that the solution is very sparse.

This is a common feature of other applications of online optimization techniques to find “sparse approximations”: one sets up an optimization problem whose objective function measures the “approximation error” of a given solution. The object we want to approximate is the optimum of the optimization problem, and we use variants of mirror descent to prove the existence of a sparse solution that is a good approximation.

]]>

In 3-XOR, we have a system of linear equations modulo 2, with three variables per equation, that might look something like

The above system is not satisfiable (if we add up the left-hand sides we get 0, if we add up the right-hand sides we get 1), but it is possible to satisfy of the equations, for example by setting all the variables to 1. In Max 3-XOR problem (which we will simply refer to as “3-XOR” from now on), given a system of equations mod 2 with three variables per equation, we want to find an assignment that satisfies as many equations as possible.

Either setting all the variables to zero or setting all the variables to one will satisfy half of the equations, and the interesting question is how much better than 1/2 it is possible to do on a given instance. Khot and Naor provide an algorithm that, given an instance in which it is possible to satisfy an fraction of equations, finds a solution that satisfies at least fraction of equations, where is the number of variables. The algorithm is randomized, it runs in polynomial time, and it succeeds with high probability. I believe that it is still the state of the art in terms of worst-case approximation guarantee.

Like the approximation algorithm for sparsest cut in Abelian Cayley graphs implied by the result of Bauer et al. that was the subject of the last two posts, the result of Khot and Naor *does not* prove a bound on the integrality gap of a relaxation of the problem.

I will describe the Khot-Naor algorithm and describe how it manages to use convex optimization to provide an approximation algorithm, but without establishing an integrality gap bound. I thank my student Lucas Pesenti for explaining the algorithm to me and for thinking together about this problem.

If our 3-XOR instance has equations and variables, then the problem of maximizing the number of satisfied equations can be rewritten as

so that our goal is to approximate the combinatorial optimization problem

Up to a constant factor loss in the approximation guarantee, Khot and Naor show that the above is equivalent to

where is a symmetric 3-tensor with entries in and with non-zero entries.

Before continuing, let us recall that if is an matrix, then its -to- operator norm has the characterization

We could also define the “Grothendieck norm” of a matrix as the following natural semidefinite programming relaxation of the -to- norm:

where the and are arbitrary vectors. The Grothendieck inequality is

where the is an absolute constant, known to be less than . Furthermore, the above inequality has a constructive proof, and it leads to a polynomial time constant factor approximation for the problem of finding values and in that maximize (2).

Basically, we can see problem (1) as the natural generalization of (2) to tensors, and one would like to see a semidefinite programming relaxation of (1) achieving something resembling the Grothendieck inequality, but with a loss of something like . As I mentioned above, this remains an open question, as far as I know.

The idea of Khot and Naor is the following. Suppose that we are given an instance of problem (1), and suppose that is an optimal solution, and let us call

the value of the optimum (the algorithm will not need to know or guess ).

The key step is now to see that if we pick a *random* , there is at least a probability that

This is a bit difficult, but it is really easy to see that with probability we have

and we can do that by seeing that by defining a vector such that , so that

So we have

which, using Cauchy-Schwarz, gives

Now, for a random , we have

and

so by Paley–Zygmund we have, let’s say

which, together with the definition of and the fact that the distribution of is symmetric around zero, gives us the claim (3).

Now suppose that we have been lucky and that we have found such an . We define the matrix

and we see that our claim can be written as

At this point we just apply the algorithm implied by the Grothendieck inequality to the matrix , and we find and in such that

meaning that

Summarizing, our algorithm is to pick a random vector and to find a constant-factor approximation for the problem

using semidefinite programming. We do that times, and take the best solution.

The analysis can be turned into an upper bound certificate in the following way. For the (suboptimal) analysis using Paley-Zygmund, we only need the entries of the random to be 4-wise independent, and there are distributions on where the entries are unbiased and 4-wise independent, and such that the sample space is of size polynomial in . Thus, one could write an SDP relaxation of (4) for each in the support of such a distribution, and then take the maximum of these SDPs, multiply it by , and it would be a certified upper bound. Such an upper bound, however, would not come from a relaxation of the 3-XOR problem, and I find it really strange that it is not clear how to turn these ideas into a proof that, say, the standard degree-4 sum-of-squares semidefinite programming relaxation of 3-XOR has an integrality gap at most .

]]>which, together with the fact that and , implies the Buser inequality

The proof of (1), due to Shayan Oveis Gharan and myself, is very similar to the proof by Bauer et al. of (2).

**1. Ideas **

For a positive integer parameter , call the multigraph whose adjacency matrix is , where is the adjacency matrix of . So is a -regular graph, and each edge in corresponds to a length- walk in . Our proof boils down to showing

which gives (1) after we note that

and

and we combine the above inequalities with :

The reader will see that our argument could also prove (roughly as done by Bauer et al.) that , simply by reasoning about distributions of cuts instead of reasoning about ARV solutions, which would give (2) more directly. By reasoning about ARV solutions, however, we are also able to establish (3), which we think is independently interesting.

It remains to prove (4). I will provide a completely self-contained proof, including the definition of Cayley graphs and of the ARV relaxation, but, first, here is a summary for the reader already familiar with this material: we take an optimal solution of ARV for , and bound how well it does for . We need to understand how much bigger is the fraction of edges cut by the solution in compared to the fraction of edges cut in ; a random edge in is obtained by randomly sampling generators and adding them together, which is roughly like sampling times each generator, each time with a random sign. Because of cancellations, we expect that the sum of these random generators can be obtained by summing roughly copies of each of the generators, or generators in total. So a random edge of corresponds roughly to edges of , and this is by how much, at most, the fraction of cut edges can grow.

**2. Definitions **

Now we present more details and definition. Recall that if is a group, for which we use additive notation, and is a set or multiset of group elements, then the Cayley graph is the graph that has a vertex for every group element and an edge for every group element and every element . We restrict ourselves to undirected graphs, so we will always assume that if is an element of , then is also an element of (with the same multiplicity, if if a multiset). Note that the resulting undirected graph is -regular.

Several families of graphs can be seen to be Cayley graphs, including cycles, cliques, balanced complete bipartite graphs, hypercubes, toruses, and so on. All the above examples are actually Cayley graphs of *Abelian* groups. Several interesting families of graphs, for example several families of expanders, are Cayley graphs of non-Abelian groups, but the result of this post will apply only to Abelian groups.

To define the ARV relaxation, let us take it slowly and start from the definition of the sparsest cut problem. The edge expansion problem is closely related to the *sparsest cut* problem, which can be defined as

where is the normalized Laplacian of and is the normalized Laplacian of the clique on vertices. We wrote it this way to emphasize the similarity with the computation of the second smallest normalized Laplacian eigenvalue, which can be written as

where the second equality is perhaps not obvious but is easily proved (all the solutions have the same cost function on both sides, and the last expression is shift invariant, so there is no loss in optimizing over all or just in the space ). We see that the computation of is just a relaxation of the sparsest cut problem, and so we have

We can write the sparsest cut problem in a less algebraic version as

and recall that

The cost function for does not change if we switch with , so could be defined equivalently as an optimization problem over subsets , and at this point and are the same problem except for an extra factor of in the denominator of . Such a factor is always between and , so we have:

(Note that all these definitions have given us a particularly convoluted proof that .)

Yet another way to characterize is as

Where is the Frobenius inner product between matrices. This is also something that is not obvious but that it is not difficult to prove, the main point being that if we write a PSD matrix as , then

and so there is no loss in passing from an optimization over all PSD matrices versus all rank-1 PSD matrices.

We can rewrite in terms of the Cholesky decomposition of as

where the correspondence between PSD matrices and vectors is that we have (that is, is the Cholesky decomposition of and is the Gram matrix of the ). An integral solution of the sparsest cut problem corresponds to choosing rank and solutions in which each 1-dimensional is either 1 or 0, corresponding on whether is in the set or not. The ARV relaxation is

which is a relaxation of sparsest cut because the “triangle inequality” constraints that we introduced are satisfied by 1-dimensional 0/1 solutions. Thus we have

**3. The Argument **

Let us take any solution for ARV, and let us symmetrize it so that the symmetrized solution satisfies

That is, make sure that the contribution of each edge to the numerator of the cost function depends only on the generator that defines the edge, and not on the pair of endpoints.

It is easier to see that this symmetrization is possible if we view our solution as a PSD matrix . In this case, for every group element , we can let be the solution (with the same cost) obtained by permuting rows and columns according to the mapping ; then we can consider the solution , which satisfies the required symmetry condition.

Because of this condition, the cost function applied to in is

and the cost function applied to in is

meaning that our goal is now simply to prove

or, if we take a probabilistic view

If we let be the number of times that generator appears in the sum , counting cancellations (so that, if appears 4 times and appears 6 times we let and ) we have

where multiplying an integer by a generator means adding the generator to itself that many times. Using the triangle inequality and the symmetrization we have

The next observation is that, for every ,

where the expectation is over the choice of . This is because can be seen as , where we define if , if and otherwise. We have

Combining everything, we have

So every solution of ARV for is also a solution for with a cost that increases at most by a factor, which proves (4) as promised.

**4. A Couple of Interesting Questions **

We showed that ARV has integrality gap at most for every -regular Abelian Cayley graph, but we did not demonstrate a rounding algorithm able to achieve an approximation ratio.

If we follow the argument, starting from an ARV solution of cost , choosing of the order of we see that has an ARV solution (the same as before) of cost at most, say, , and so , implying that and so Fiedler’s algorithm according to a Laplacian eigenvalue finds a cut of sparsity at most .

We can also see that if is the matrix corresponding to an ARV solution of value for , then one of the eigenvectors of must be a test vector of Rayleigh quotient at most for the Laplacian , where is order of . However it is not clear how to get, from such a vector, a test vector for the Laplacian of of Rayleigh quotient at most for the Laplacian of , though one such vector should be for a properly chosen in the range .

If this actually work, then the following, or something like that, would be a rounding algorithm: given a PSD solution of ARV of cost , consider PSD matrices of the form , which do not necessarily satisfy the triangle inequalities any more, for , and try to round them using a random hyperplane. Would this make sense for other classes of graphs?

It is plausible that sparsest cut in Abelian Cayley graphs has actually a constant-factor approximation in polynomial or quasi-polynomial time, and maybe either that ARV achieves constant-factor approximation.

]]>

The starting point is the Cheeger inequalities on graphs.

If is a -regular graph, and is its adjacency matrix, then we define the *normalized Laplacian matrix* of as , and we call the second smallest (counting multiplicities) eigenvalue of . It is important in the spectral theory of graphs that this eigenvalue has a variational characterization as the solution of the following optimization problem:

The *normalized edge expansion* of is defined as

where denotes the number of edges with one endpoint in and one endpoint outside . We have talked about these quantities several times, so we will just jump to the fact that the following *Cheeger inequalities* hold:

Those inequalities are called Cheeger inequality because the upper bound is the discrete analog of a result of Cheeger concerning Riemann manifolds. There were previous posts about this here and here, and for our current purposes it will be enough to recall that, roughly speaking, a Riemann manifold defines a space on which we can define real-valued and complex-valued functions, and such functions can be integrated and differentiated. Subsets of a Riemann manifold have, if they are measurable, a well defined volume, and their boundary a well defined lower-dimensional volume; it possible to define a Cheeger constant of a manifold in a way that is syntactically analogous to the definition of edge expansion:

It is also possible to define a Laplacian operator on smooth functions , and if is the second smallest eigenvalue of we have the characterization

and Cheeger proved

which is syntactically almost the same inequality and, actually, has a syntactically very similar proof (the extra factor comes from the fact that things are normalized slightly differently).

The “dictionary” between finite graphs and compact manifolds is that vertices correspond to points, degree correspond to dimensionality, adjacent vertices and correspond to “infinitesimally close” points along one dimension, and the set of edges in the cut correspond to the boundary of a set , “volume” corresponds to number of edges (both when we think of volume of the boundary and volume of a set), vectors correspond to smooth functions , and the collection of values of at the neighbors of a vertex corresponds to the gradient of a function at point .

Having made this long premise, the point of this post is that the inequality

which is the easy direction to show in the graph case, does not hold for manifolds.

The easy proof for graphs is that if is the cut that realizes the minimum edge expansion, we can take , or rather the projection of the above vector on the space orthogonal to , and a two-line calculation gives the inequality.

If, however, is a subset of points of the manifold that realizes the Cheeger constant, we cannot define to be the indicator of , because the function would not be smooth and its gradient would be undefined.

We could think of rescuing the proof by taking a smoothed version of the indicator of , something that goes to 1 on one side and to 0 on the other side of the cut, as a function of the distance from the boundary. The “quadratic form” of such a function, however, would depend not just on the area of the boundary of , but also on how quickly the volume of the subset of the manifold at distance from the boundary of grows as a function of .

If the Ricci curvature of the manifold is negative, this volume grows very quickly, and the quadratic form of the smoothed function is very bad. The graph analog of this would be a graph made of two expanders joined by a bridge edge, and the problem is that the analogy between graphs and manifolds breaks down because “infinitesimal distance” in the manifold corresponds to any distance on the graph, and although the bridge edge is a sparse cut in the graph, it is crossed by a lot of paths of length , or at least this is the best intuition that I have been able to build about what breaks down here in the analogy between graphs and manifolds.

If the Ricci curvature of the manifold is non-negative and the dimension is bounded, there is however an inequality in manifolds that goes in the direction of the “easy graph Cheeger inequality,” and it has the form

This is the *Buser inequality* for manifolds of non-negative Ricci curvature, and note how strong it is: it says that and approximate each other up to the constant factor .

If the Ricci curvature is negative, and is the dimension of the manifold, then one has the inequality

Given the importance that curvature has in the study of manifolds, there has been considerable interest in defining a notion of curvature for graphs. For example curvature relates to how quickly balls around a point grow with the diameter the ball, a concept that is of great importance in graphs as well; curvature is a locally defined concept that has a global consequences, and in graph property testing one is precisely interested in understanding global properties based on local ones; and a Buser-type inequality in graphs would be very interesting because it would provide a class of graphs for which Fiedler’s algorithm provides constant-factor approximation for sparsest cut.

There have been multiple attempts at defining notions of curvature for graphs, the main ones being Olivier curvature and Bakry-Emery curvature. Each captures some but not all of the useful properties of Ricci curvature in manifolds. My (admittedly, poorly informed) intuition is that curvature in manifold is defined “locally” but, as we mentioned above, “local” in graphs can have multiple meanings depending on the distance scale, and it is difficult to come up with a clean and usable definition that talks about multiple distance scales.

Specifically to the point of the Buser inequality, Bauer et al and Klartag et al prove Buser inequalities for graphs with respect to the Bakry-Emery definition. Because Cayley graphs of Abelian groups happen to have curvature 0 according to this definition, their work gives the following result:

Theorem 1 (Buser inequality for Cayley graphs of Abelian groups)If is a -regular Cayley graph of an Abelian group, is the second smallest normalized Laplacian eigenvalue of and is the normalized edge expansion of , we have

(Klartag et al. state the result as above; Bauer et al. state it only for certain groups, but I think that their proof applies to all Abelian groups)

In particular, the above statement implies that if we have a -regular Abelian Cayley graph then provides an approximation of the sparsest cut up to a multiplicative error and that Fiedler’s algorithm is an approximate algorithm for sparsest cut in Abelian Cayley graphs.

At some point in 2015 (or maybe 2014, during the Simons program on spectral graph theory?), I read the paper of Bauer at al. and wondered about a few questions. First of all, is there a way to prove the above theorem without using any notion of curvature?

Secondly, the theorem implies that Fiedler’s algorithm has a good approximation, but it does so in a very unusual way: we have that is a continuous relaxation of the sparsest cut problem, and we have Fiedler’s algorithm that, as analyzed by Cheeger’s inequality, provides a somewhat bad rounding, and finally the Buser inequality is telling us that the relaxation has very poor integrality gap on all Abelian Cayley graphs, and so even though the rounding seems bad it is actually good compared to the integral optimum. There should be an underlying reason for all this, meaning some other relaxation that has an actual integrality gap for sparsest cut that is at most .

I mentioned these questions to Shayan Oveis Gharan, and we were able to prove that if is the value of the Arora-Rao-Vazirani (or Goemans-Linial) semidefinite programming relaxation of sparsest cut, then, for all -regular Cayley graphs, we have

which proves the above theorem and, together with the Cheeger inequality , also shows

that is, the ARV relaxation has integrality gap at most on Abelian Cayley graphs. I will discuss the proof (which is a sort of “sum-of-squares version” of the proof of Bauer et al.) in the next post.

]]>Today the Italian academic community, along with lots of other people, was delighted to hear that Giorgio Parisi is one of the three recipients of the 2021 Nobel Prize for Physics.

Parisi has been a giant in the area of understanding “complex” and “disordered” systems. Perhaps, his most influential contribution has been his “replica method” for the analysis of the Sherrington-Kirkpatrick model. His ideas have led to several breakthroughs in statistical physics by Parisi and his collaborators, and they have also found applications in computer science: to tight analyses on a number of questions about combinatorial optimization on random graphs, to results on random constraint satisfaction problems (including the famous connection with random k-SAT analyzed by Mezard, Parisi and Zecchina) and random error correcting codes, and to understanding the solution landscape in optimization problems arising from machine learning. Furthermore these ideas have also led to the development and analysis of algorithms.

The news was particularly well received at Bocconi, where most of the faculty of the future CS department has done work that involved the replica method. (Not to be left out, even I have recently used replica methods.)

Mezard and Montanari have written a book-length treatment on the interplay between ideas from statistical physics, algorithms, optimization, information theory and coding theory that arise from this tradition. Readers of *in theory* looking for a shorter exposition aimed at theoretical computer scientists will enjoy these notes posted by Boaz Barak, or this even shorter post by Boaz.

In this post, I will try to give a sense to the reader of what the replica method for the Sherrington-Kirkpatrick model looks like when applied to the average-case analysis of optimization problems, stripped of all the physics. Of course, without the physics, nothing makes any sense, and the interested reader should look at Boaz’s posts (and to references that he provides) for an introduction to the context. I did not have time to check too carefully what I wrote, so be aware that several details could be wrong.

What is the typical value of the max cut in a random graph with vertices?

Working out an upper bound using union bounds and Chernoff bound, and a lower bound by thinking about a greedy algorithm, we can quickly convince ourselves that the answer is . Great, but *what is the constant in front of the ?* This question is answered by the *Parisi formula*, though this fact was not rigorously established by Parisi. (Guerra proved that the formula gives an upper bound, Talagrand proved that it gives a tight bound.)

Some manipulations can reduce the question above to the following question: suppose that I pick a random symmetric matrix , say with zero diagonal, and such that (up to the symmetry requirement) the entries are mutually independent and each entry is equally likely to be or , or perhaps each entry is distributed according to a standard normal distribution (the two versions can be proved to be equivalent), what is the typical value of

up to additive terms,?

As a first step, we could replace the maximum with a “soft-max,” and note that, for every choice of , we have

The above upper bound gets tighter and tighter for larger , so if we were able to estimate

for every (where the expectation is over the randomness of ) then we would be in good shape.

We could definitely use convexity and write

and then use linearity of expectation and independence of the entries of to get to

Now things simplify quite a bit, because for all the expression , in the Rademacher setting, is equally likely to be or , so that, for , we have

and

so that

which, choosing , gives an upper bound which is in the right ballpark. Note that this is exactly the same calculations coming out of a Chernoff bound and union bound. If we optimize the choice of we unfortunately do not get the right constant in front of .

So, if we call

we see that we lose too much if we do the step

But what else can we do to get rid of the logarithm and to reduce to an expression in which we take expectations of products of independent quantities (if we are not able to exploit the assumption that has mutually independent entries, we will not be able to make progress)?

The idea is that if is a small enough quantity (something much smaller than ), then is close to 1 and we have the approximation

and we obviously have

so we can use the approximation

and

Let’s forget for a moment that we want to be a very small parameter. If was an integer, we would have

Note that the above expression involves choices of -tuples of feasible solutions of our maximization problem. These are the “replicas” in “replica method.”

The above expression does not look too bad, and note how we were fully able to use the independence assumption and “simplify” the expression. Unfortunately, it is actually still very bad. In this case it is preferable to assume the to be Gaussian, write the expectation as an integral, do a change of variable and some tricks so that we reduce to computing the maximum of a certain function, let’s call it , where the input is a matrix, and then we have to guess what is an input of maximum value for this function. If we are lucky, the maximum is equivalent by a in which all entries are identical, the *replica symmetric solution*. In the Sherrington-Kirkpatrick model we don’t have such luck, and the next guess is that the optimal is a block-diagonal matrix, or a *replica symmetry-breaking solution*. For large , and large number of blocks, we can approximate the choice of such matrices by writing down a system of differential equations, the *Parisi equations*, and we are going to assume that such equations do indeed describe an optimal and so a solution to the integral, and so they give as a computation of .

After all this, we get an expression for for every sufficiently large integer , as a function of up to lower order errors. What next? Remember how we wanted to be a tiny real number and not a sufficiently large integer? Well, we take the expression, we forget about the error terms, and we set

]]>We will soon put up a call for nominations for the test of time award to be given at FOCS 2021 (which will take place in Boulder, Colorado, in early 2022). There are three award categories, recognizing, respectively, papers from FOCS 2011, FOCS 2001, and FOCS 1991. In each category, it is also possible to nominate older papers, up to four years before the target conference. For example, for the thirty-year category, it is possible to nominate papers from FOCS 1987, FOCS 1988, FOCS 1989, FOCS 1990, in addition to the target conference FOCS 1991.

Nominations should be sent by October 31, 2021 to focs.tot.2021@gmail.com with a subject line of “FOCS Test of Time Award”. Nominations should contain an explanation of the impact of the nominated paper(s), including references to follow-on work. Self-nominations are discouraged.

In the second week of November, 2021, the Simons Institute will host a workshop on using cryptographic assumptions to prove average-case hardness of problems in high-dimensional statistics. This is such a new topic that the goal of the workshop will be more to explore new directions than to review known results, and we (think that we have) already invited all the authors of recent published work of this type. If you have proved results of this type, and you have not been invited (perhaps because your results are still unpublished?) and you would like to participate in the workshop, there is still space in the schedule so feel free to contact me or one of the other organizers. For both speakers and attendees, physical participation is preferred, but remote participation will be possible.

]]>Last year was characterized by a sudden acceleration of Bocconi’s plans to develop a computer science group. From planning for a slow growth of a couple of people a year until, in 5-7 years, we could have the basis to create a new department, it was decided that a new computer science department would start operating next year — perhaps as soon as February 2022, but definitely, or at least to the extent that one can make definite plans in these crazy times, by September 2022.

Consequently, we went on a hiring spree that was surprisingly successful. Five computer scientists and four statistical physicists have accepted our offers and are coming between now and next summer. In computer science, Andrea Celli (who won the NeurIPS best paper award last year) and Marek Elias started today. Andrea, who is coming from Facebook London, works in algorithmic game theory, and Marek, who is coming TU Eindhoven, works in optimization. Within the next couple of weeks, or as soon as his visa issues are sorted out, Alon Rosen will join us from IDC Herzliya as a full professor. Readers of *in theory* may know Alon from his work on lattice-based cryptography, or his work on zero-knowledge, or perhaps his work on the cryptographic hardness of finding Nash equilibria. Two other computer science tenured faculty members are going to come, respectively, in February and September 2022, but I am not sure if their moves are public yet.

Meanwhile, I have been under-spending my ERC grant, but perhaps this is going to change and some of my readers will help me out.

If you are interested in coming to Milan for a post-doc, do get in touch with me. A call will be out in a month or so.

After twenty years in Northern California, I am still readjusting to seasonal weather. September is among Milan’s best months: the oppressive heat of the summer gives way to comfortable days and cool nights, but the days are still bright and sunny. Currently, there is no quarantine requirement or other travel restrictions for fully vaccinated international travellers. If you want to visit, this might be your best chance until Spring Break (last year we had a semi-lockdown from late October until after New Year, which might very well happen again; January and February are not Milan’s best months; March features spectacular cherry blossoms, and it is again an ok time to visit).

]]>Piazza Duomo, in Milan, on July 11, 2021

]]>