Online Optimization Post 7: Matrix Multiplicative Weights Update

This is the seventh in a series of posts on online optimization, where we alternate one post explaining a result from the theory of online convex optimization and one post explaining an “application” in computational complexity or combinatorics. The first two posts were about the technique of Multiplicative Weights Updates and its application to “derandomizing” probabilistic arguments based on combining a Chernoff bound and a union bound. The third and fourth post were about the Follow-the-Regularized-Leader framework, which unifies multiplicative weights and gradient descent, and a “gradient descent view” of the Frieze-Kannan Weak Regularity Lemma. The fifth and sixth post were about the constrained version of the Follow-the-Regularized-Leader framework, and the Impagliazzo Hard-Core Set Lemma. Today we shall see the technique of Matrix Multiplicative Weights Updates.

1. Matrix Multiplicative Weights Update

In this post we consider the following generalization, introduced and studied by Arora and Kale, of the “learning from expert advice” setting and the multiplicative weights update method. In the “experts” model, we have a repeated game in which, at each time step {t}, we have the option of following the advice of one of {n} experts; if we follow the advice of expert {i} at time {t}, we incur a loss of {\ell_t (i)}, which is unknown to us (although, at time {t} we know the loss functions {\ell_1(\cdot),\ldots,\ell_{t-1}(\cdot)}). We are allowed to choose a probabilistic strategy, whereby we follow the advice of expert {i} with probability {x_t(i)}, so that our expected loss at time {t} is {\sum_{i=1}^n x_t(i) \ell_t(i)}.

In the matrix version, instead of choosing an expert {i} we are allowed to choose a unit {n}-dimensional vector {v_t}, and the loss incurred in choosing the vector {v_t} is {v_t ^T L_t v_t}, where {L_t} is an unknown symmetric {n\times n} matrix. We are also allowed to choose a probabilistic strategy, so that with probability {x_t(j)} we choose the unit vector {v_t^{(j)}}, and we incur the expected loss

\displaystyle  \sum_j x_t (j) \cdot (v_t^{(j)})^T L_t v_t^{(j)}

Continue reading

Postdoc Positions

The call is out for two postdoctoral positions at Bocconi to work in my group (see below for how to apply). If you are interested and you have any questions, feel free to email me (L.Trevisan at Unibocconi dot it)

The negotiable start date is September 1st, 2022. Each position is for one year, renewable for a second. The positions offer an internationally competitive salary (up to 65,000 Euro per year, tax-free, plus relocation assistance and travel allowance), in a wonderful location that, at long last, is back to more or less normal life. The application deadline is December 17, 2021.

Among the topics that I am interested in are spectral graph theory, average-case complexity, “applications” of semidefinite programming, random processes on networks, approximation algorithms, pseudorandomness and combinatorial constructions.

Bocconi Computer Science is building up a theory group: besides me, we have Alon Rosen, Marek Elias, a tenured person that will join next Fall, and more hires are on the horizon. Now that traveling is ok again, and considering that Alon and I both have ERC grants, we should expect a big stream of theory visitors coming and going through Bocconi from week-long visits to semester or year long sabbaticals.

To apply, go to https://www.unibocconi.eu/faculty-postdoc and look for the position advertised as “BIDSA Informatics”, which looks like this:

and click on “apply online”. Currently it is the second position from the top in the list

Online Optimization Post 6: The Impagliazzo Hard-Core Set Lemma

(This is the sixth in a series of posts on online optimization techniques and their “applications” to complexity theory, combinatorics and pseudorandomness. The plan for this series of posts is to alternate one post explaining a result from the theory of online convex optimization and one post explaining an “application.” The first two posts were about the technique of multiplicative weight updates and its application to “derandomizing” probabilistic arguments based on combining a Chernoff bound and a union bound. The third and fourth post were about the Follow-the-Regularized-Leader framework, and how it unifies multiplicative weights and gradient descent, and a “gradient descent view” of the Frieze-Kannan Weak Regularity Lemma. The fifth post was about the constrained version of the Follow-the-Regularized-Leader framework, and today we shall see how to apply that to a proof of the Impagliazzo Hard-Core Lemma.)

Continue reading

ARV on Abelian Cayley Graphs

Continuing from the previous post, we are going to prove the following result: let {G} be a {d}-regular Cayley graph of an Abelian group, {\phi(G)} be the normalized edge expansion of {G}, {ARV(G)} be the value of the ARV semidefinite programming relaxation of sparsest cut on {G} (we will define it below), and {\lambda_2(G)} be the second smallest normalized Laplacian eigenvalue of {G}. Then we have

\displaystyle   \lambda_2 (G) \leq O(d) \cdot (ARV (G))^2 \ \ \ \ \ (1)

which, together with the fact that {ARV(G) \leq 2 \phi(G)} and {\phi(G) \leq \sqrt{2 \lambda_2}}, implies the Buser inequality

\displaystyle   \lambda_2 (G) \leq O(d) \cdot \phi^2 (G) \ \ \ \ \ (2)

and the approximation bound

\displaystyle   \phi(G) \leq O(\sqrt d) \cdot ARV(G) \ \ \ \ \ (3)

The proof of (1), due to Shayan Oveis Gharan and myself, is very similar to the proof by Bauer et al. of (2).

Continue reading

Hasselmann, Manabe and Parisi win 2021 Physics Nobel Prize

itscomingrome-lorenza-parisi

Today the Italian academic community, along with lots of other people, was delighted to hear that Giorgio Parisi is one of the three recipients of the 2021 Nobel Prize for Physics.

Parisi has been a giant in the area of understanding “complex” and “disordered” systems. Perhaps, his most influential contribution has been his “replica method” for the analysis of the Sherrington-Kirkpatrick model. His ideas have led to several breakthroughs in statistical physics by Parisi and his collaborators, and they have also found applications in computer science: to tight analyses on a number of questions about combinatorial optimization on random graphs, to results on random constraint satisfaction problems (including the famous connection with random k-SAT analyzed by Mezard, Parisi and Zecchina) and random error correcting codes, and to understanding the solution landscape in optimization problems arising from machine learning. Furthermore these ideas have also led to the development and analysis of algorithms.

The news was particularly well received at Bocconi, where most of the faculty of the future CS department has done work that involved the replica method. (Not to be left out, even I have recently used replica methods.)

Mezard and Montanari have written a book-length treatment on the interplay between ideas from statistical physics, algorithms, optimization, information theory and coding theory that arise from this tradition. Readers of in theory looking for a shorter exposition aimed at theoretical computer scientists will enjoy these notes posted by Boaz Barak, or this even shorter post by Boaz.

In this post, I will try to give a sense to the reader of what the replica method for the Sherrington-Kirkpatrick model looks like when applied to the average-case analysis of optimization problems, stripped of all the physics. Of course, without the physics, nothing makes any sense, and the interested reader should look at Boaz’s posts (and to references that he provides) for an introduction to the context. I did not have time to check too carefully what I wrote, so be aware that several details could be wrong.

What is the typical value of the max cut in a {G_{n,\frac 12}} random graph with {n} vertices?

Working out an upper bound using union bounds and Chernoff bound, and a lower bound by thinking about a greedy algorithm, we can quickly convince ourselves that the answer is {\frac {n^2}{4} + \Theta(n^{1.5})}. Great, but what is the constant in front of the {n^{1.5}}? This question is answered by the Parisi formula, though this fact was not rigorously established by Parisi. (Guerra proved that the formula gives an upper bound, Talagrand proved that it gives a tight bound.)

Some manipulations can reduce the question above to the following question: suppose that I pick a random {n\times n} symmetric matrix {M}, say with zero diagonal, and such that (up to the symmetry requirement) the entries are mutually independent and each entry is equally likely to be {+1} or {-1}, or perhaps each entry is distributed according to a standard normal distribution (the two versions can be proved to be equivalent), what is the typical value of

\displaystyle  \max _{x \in \{+1,1\}^n } \ \ \frac 1{n^{1.5}} x^T M x

up to {o_n(1)} additive terms,?

As a first step, we could replace the maximum with a “soft-max,” and note that, for every choice of {\beta>0}, we have

\displaystyle  \max _{x \in \{+1,1\}^n } \ \ x^T M x \leq \frac 1 \beta \log \sum_{x \in \{+1,1\}^n } e^{\beta x^T Mx}

The above upper bound gets tighter and tighter for larger {\beta}, so if we were able to estimate

\displaystyle  \mathop{\mathbb E} \log \sum_{x \in \{+1,1\}^n } e^{\beta x^T Mx}

for every {\beta} (where the expectation is over the randomness of {M}) then we would be in good shape.

We could definitely use convexity and write

\displaystyle  \mathop{\mathbb E} \max _{x \in \{+1,1\}^n } \ \ x^T M x \leq \frac 1 \beta \mathop{\mathbb E} \log \sum_{x \in \{+1,1\}^n } e^{\beta x^T Mx} \leq \frac 1 \beta \log \mathop{\mathbb E} \sum_{x \in \{+1,1\}^n } e^{\beta x^T Mx}

and then use linearity of expectation and independence of the entries of {M} to get to

\displaystyle  \leq \frac 1 \beta \log \sum_{x \in \{+1,1\}^n } \prod_{1\leq i < j\leq n} \mathop{\mathbb E} e^{2\beta M_{i,j} x_i x_j }

Now things simplify quite a bit, because for all {i<j} the expression {M_{i,j} x_i x_j}, in the Rademacher setting, is equally likely to be {+1} or {-1}, so that, for {\beta = o(1)}, we have

\displaystyle  \mathop{\mathbb E} e^{2\beta M_{i,j} x_i x_j } = cosh (2\beta) \leq 1 + O(\beta^2) \leq e^{O(\beta^2)}

and

\displaystyle  \sum_{x \in \{+1,1\}^n } \prod_{1\leq i < j\leq n} \mathop{\mathbb E} e^{2\beta M_{i,j} x_i x_j } \leq 2^n \cdot e^{O(\beta^2 n^2)}

so that

\displaystyle  \frac 1 \beta \log \sum_{x \in \{+1,1\}^n } \prod_{1\leq i < j\leq n} \mathop{\mathbb E} e^{2\beta M_{i,j} x_i x_j } \leq \frac {O(n)}{\beta} + O(\beta n^2)

which, choosing {\beta = 1/\sqrt n}, gives an {O(n^{1.5})} upper bound which is in the right ballpark. Note that this is exactly the same calculations coming out of a Chernoff bound and union bound. If we optimize the choice of {\beta} we unfortunately do not get the right constant in front of {n^{1.5}}.

So, if we call

\displaystyle  F := \sum_{x \in \{+1,1\}^n } e^{\beta x^T Mx}

we see that we lose too much if we do the step

\displaystyle  \mathop{\mathbb E} \log F \leq \log \mathop{\mathbb E} F

But what else can we do to get rid of the logarithm and to reduce to an expression in which we take expectations of products of independent quantities (if we are not able to exploit the assumption that {M} has mutually independent entries, we will not be able to make progress)?

The idea is that if {k>0} is a small enough quantity (something much smaller than {1/\log F}), then {F^k} is close to 1 and we have the approximation

\displaystyle  \log F^k \approx F^k-1

and we obviously have

\displaystyle  \log F^k = k \log F

so we can use the approximation

\displaystyle  \log F \approx \frac 1k (F^k - 1)

and

\displaystyle  \mathop{\mathbb E} \log F \approx \frac 1k (\mathop{\mathbb E} F^k - 1)

Let’s forget for a moment that we want {k} to be a very small parameter. If {k} was an integer, we would have

\displaystyle  \mathop{\mathbb E} F^k = \mathop{\mathbb E} \left( \sum_{x \in \{+1,1\}^n } e^{\beta x^T Mx} \right)^k = \sum_{x^{(1)},\ldots x^{(k)} \in \{+1,-1\}^n} \mathop{\mathbb E} e^{\beta \cdot ( x^{(1) T} M x^{(1)} + \cdots + x^{(k)T} M x^{(k)}) }

\displaystyle  = \sum_{x^{(1)},\ldots x^{(k)} \in \{+1,-1\}^n} \ \ \prod_{i< j} \ \mathop{\mathbb E} e^{2\beta M_{i,j} \cdot ( x^{(1)}_i x^{(1)}_j + \cdots + x^{(k)}_i x^{(k)}_j )}

Note that the above expression involves choices of {k}-tuples of feasible solutions of our maximization problem. These are the “replicas” in “replica method.”

The above expression does not look too bad, and note how we were fully able to use the independence assumption and “simplify” the expression. Unfortunately, it is actually still very bad. In this case it is preferable to assume the {M_{i,j}} to be Gaussian, write the expectation as an integral, do a change of variable and some tricks so that we reduce to computing the maximum of a certain function, let’s call it {G(z)}, where the input {z} is a {k \times k} matrix, and then we have to guess what is an input of maximum value for this function. If we are lucky, the maximum is equivalent by a {z} in which all entries are identical, the replica symmetric solution. In the Sherrington-Kirkpatrick model we don’t have such luck, and the next guess is that the optimal {z} is a block-diagonal matrix, or a replica symmetry-breaking solution. For large {k}, and large number of blocks, we can approximate the choice of such matrices by writing down a system of differential equations, the Parisi equations, and we are going to assume that such equations do indeed describe an optimal {z} and so a solution to the integral, and so they give as a computation of {(\mathop{\mathbb E} F^k - 1)/k}.

After all this, we get an expression for {(\mathop{\mathbb E} F^k - 1)/k} for every sufficiently large integer {k}, as a function of {k} up to lower order errors. What next? Remember how we wanted {k} to be a tiny real number and not a sufficiently large integer? Well, we take the expression, we forget about the error terms, and we set {k=0\ldots}

A Couple of Announcements

In the second week of July, 2022, there will be a summer school on algorithmic fairness at IPAM, on the UCLA campus, with Cynthia Dwork and Guy Rothblum among the lecturers. Applications (see the above link) are due by March 11, 2022.

We will soon put up a call for nominations for the test of time award to be given at FOCS 2021 (which will take place in Boulder, Colorado, in early 2022). There are three award categories, recognizing, respectively, papers from FOCS 2011, FOCS 2001, and FOCS 1991. In each category, it is also possible to nominate older papers, up to four years before the target conference. For example, for the thirty-year category, it is possible to nominate papers from FOCS 1987, FOCS 1988, FOCS 1989, FOCS 1990, in addition to the target conference FOCS 1991.

Nominations should be sent by October 31, 2021 to focs.tot.2021@gmail.com with a subject line of “FOCS Test of Time Award”. Nominations should contain an explanation of the impact of the nominated paper(s), including references to follow-on work. Self-nominations are discouraged.

In the second week of November, 2021, the Simons Institute will host a workshop on using cryptographic assumptions to prove average-case hardness of problems in high-dimensional statistics. This is such a new topic that the goal of the workshop will be more to explore new directions than to review known results, and we (think that we have) already invited all the authors of recent published work of this type. If you have proved results of this type, and you have not been invited (perhaps because your results are still unpublished?) and you would like to participate in the workshop, there is still space in the schedule so feel free to contact me or one of the other organizers. For both speakers and attendees, physical participation is preferred, but remote participation will be possible.

The Third Annual “Why am I in Italy and you are not?” post

I moved back to Italy exactly two years ago. I was looking for some change and for new challenges and, man, talk about being careful what you wish for!

Last year was characterized by a sudden acceleration of Bocconi’s plans to develop a computer science group. From planning for a slow growth of a couple of people a year until, in 5-7 years, we could have the basis to create a new department, it was decided that a new computer science department would start operating next year — perhaps as soon as February 2022, but definitely, or at least to the extent that one can make definite plans in these crazy times, by September 2022.

Consequently, we went on a hiring spree that was surprisingly successful. Five computer scientists and four statistical physicists have accepted our offers and are coming between now and next summer. In computer science, Andrea Celli (who won the NeurIPS best paper award last year) and Marek Elias started today. Andrea, who is coming from Facebook London, works in algorithmic game theory, and Marek, who is coming TU Eindhoven, works in optimization. Within the next couple of weeks, or as soon as his visa issues are sorted out, Alon Rosen will join us from IDC Herzliya as a full professor. Readers of in theory may know Alon from his work on lattice-based cryptography, or his work on zero-knowledge, or perhaps his work on the cryptographic hardness of finding Nash equilibria. Two other computer science tenured faculty members are going to come, respectively, in February and September 2022, but I am not sure if their moves are public yet.

Meanwhile, I have been under-spending my ERC grant, but perhaps this is going to change and some of my readers will help me out.

If you are interested in coming to Milan for a post-doc, do get in touch with me. A call will be out in a month or so.

After twenty years in Northern California, I am still readjusting to seasonal weather. September is among Milan’s best months: the oppressive heat of the summer gives way to comfortable days and cool nights, but the days are still bright and sunny. Currently, there is no quarantine requirement or other travel restrictions for fully vaccinated international travellers. If you want to visit, this might be your best chance until Spring Break (last year we had a semi-lockdown from late October until after New Year, which might very well happen again; January and February are not Milan’s best months; March features spectacular cherry blossoms, and it is again an ok time to visit).