Online Optimization Post 1: Multiplicative Weights

The multiplicative weights or hedge algorithm is the most well known and most frequently rediscovered algorithm in online optimization.

The problem it solves is usually described in the following language: we want to design an algorithm that makes the best possible use of the advice coming from ${n}$ self-described experts. At each time step ${t=1,2,\ldots}$, the algorithm has to decide with what probability to follow the advice of each of the experts, that is, the algorithm has to come up with a probability distribution ${x_t = (x_t(1),\ldots,x_t(n))}$ where ${x_t (i) \geq 0}$ and ${\sum_{i=1}^n x_t(i)=1}$. After the algorithm makes this choice, it is revealed that following the advice of expert ${i}$ at time ${t}$ leads to loss ${\ell_t (i)}$, so that the expected loss of the algorithm at time ${t}$ is ${\sum_{i=1}^n x_t(i) \ell_t (i)}$. A loss can be negative, in which case its absolute value can be interpreted as a profit.

After ${T}$ steps, the algorithm “regrets” that it did not just always follow the advice of the expert that, with hindsight, was the best one, so that the regret of the algorithm after ${T}$ steps is

$\displaystyle {\rm Regret}_T = \left( \sum_{t=1}^T\sum_{i=1}^n x_t(i) \ell_t(i) \right) - \left( \min_{i=1,\ldots,n} \ \ \sum_{t=1}^T \ell_t(i) \right)$

This corresponds to the instantiation of the framework we described in the previous post to the special case in which the set of feasible solutions ${K}$ is the set ${\Delta \subseteq {\mathbb R}^n}$ of probability distributions over the sample space ${\{ 1,\ldots,n\}}$ and in which the loss functions ${f_t (x)}$ are linear functions of the form ${f_t (x) = \sum_i x(i) \ell_t (i)}$. In order to bound the regret, we also have to bound the “magnitude” of the loss functions, so in the following we will assume that for all ${t}$ and all ${i}$ we have ${| \ell_t (i) | \leq 1}$, and otherwise we can scale everything by a known upper bound on ${\max_{t,i} |\ell_t |}$.

We now describe the algorithm.

The algorithm maintains at each step ${t}$ a vector of weights ${w_t = (w_t(1),\ldots,w_t(n))}$ which is initialized as ${w_1 := (1,\ldots,1)}$. The algorithm performs the following operations at time ${t}$:

• ${w_t (i) := w_{t-1} (i) \cdot e^{-\epsilon \ell_{t-1} (i) }}$
• ${x_t (i) := \displaystyle \frac {w_t (i) }{\sum_{j=1}^n w_t(j) }}$

That is, the weight of expert ${i}$ at time ${t}$ is ${e^{-\epsilon \sum_{k=1}^{t-1} \ell_k (i)}}$, and the probability ${x_t(i)}$ of following the advice of expert ${i}$ at time ${t}$ is proportional to the weight. The parameter ${\epsilon>0}$ is hardwired into the algorithm and we will optimize it later. Note that the algorithm gives higher weight to experts that produced small losses (or negative losses of large absolute value) in the past, and thus puts higher probability on such experts.

We will prove the following bound.

Theorem 1 Assuming that for all ${t}$ and ${i}$ we have ${| \ell_t(i) | \leq 1}$, for every ${0 < \epsilon < 1/2}$, after ${T}$ steps the multiplicative weight algorithm experiences a regret that is always bounded as

$\displaystyle {\rm Regret}_T \leq \epsilon \sum_{t=1}^T \sum_{i=1}^n x_t(i) \ell^2 _t (i) + \frac {\ln n}{\epsilon} \leq \epsilon T + \frac {\ln n}{\epsilon}$

In particular, if ${T > 4 \ln n}$, by setting ${\epsilon = \sqrt{\frac{\ln n}{T}}}$ we achieve a regret bound

$\displaystyle {\rm Regret}_T \leq 2 \sqrt{T \ln n}$

The more things change the more they stay the same

From a 1981 (!!) New York Times Article titled “Changing San Francisco is foreseen as a haven for wealthy and childless”:

A major reason for the exodus of the middle class from San Francisco, demographers say, is the high cost of housing, the highest in the mainland United States. Last month, the median cost of a dwelling in the San Francisco Standard Metropolitan Statistical Area was $129,000, according to the Federal Home Loan Bank Board in Washington, D.C. The comparable figure for New York, Newark and Jersey City was$90,400, and for Los Angeles, the second most expensive city, $118,400. ”This city dwarfs anything I’ve ever seen in terms of housing prices,” said Mr. Witte. Among factors contributing to high housing cost, according to Mr. Witte and others, is its relative scarcity, since the number of housing units has not grown significantly in a decade; the influx of Asians, whose first priority is usually to buy a home; the high incidence of adults with good incomes and no children, particularly homosexuals who pool their incomes to buy homes, and the desirability of San Francisco as a place to live.$129,000 in 1981 dollars is \$360,748 in 2019 dollars.

Online Optimization Post 0: Definitions

Online convex optimization deals with the following setup: we want to design an algorithm that, at each discrete time step ${t=1,2,\ldots}$, comes up with a solution ${x_t \in K}$, where ${K}$ is a certain convex set of feasible solution. After the algorithm has selected its solution ${x_t}$, a convex cost function ${f_t : K \rightarrow {\mathbb R}}$, coming from a known restricted set of admissible cost functions ${{\cal F}}$, is revealed, and the algorithm pays the loss ${f_t (x_t)}$.

Again, the algorithm has to come up with a solution without knowing what cost functions it is supposed to be optimizing. Furthermore, we will think of the sequence of cost functions ${f_1,f_2, \ldots,f_t,\ldots}$ not as being fixed in advanced and unknown to the algorithm, but as being dynamically generated by an adversary, after seeing the solutions provided by the algorithm. (This resilience to adaptive adversaries will be important in most of the applications.)

The offline optimum after ${T}$ steps is the total cost that the best possible fixed solution would have incurred when evaluated against the cost functions seen by the algorithm, that is, it is a solution to

$\displaystyle \min_{x\in K} \ \ \sum_{t=1}^T f_t (x)$

The regret after ${T}$ steps is the difference between the loss suffered by the algorithm and the offline optimum, that is,

$\displaystyle {\rm Regret}_T = \sum_{t=1}^T f_t (x_t) - \min_{x\in K} \ \ \sum_{t=1}^T f_t (x)$

The remarkable results that we will review give algorithms that achieve regret

$\displaystyle {\rm Regret}_T \leq O_{K, {\cal F}} (\sqrt T)$

that is, for fixed ${K}$ and ${{\cal F}}$, the regret-per-time-step goes to zero with the number of steps, as ${O\left( \frac 1 {\sqrt T} \right)}$. It is intuitive that our bounds will have to depend on how big is the “diameter” of ${K}$ and how large is the “magnitude” and “smoothness” of the functions ${f\in {\cal F}}$, but depending on how we choose to formalize these quantities we will be led to define different algorithms.

Online Optimization for Complexity Theorists

Last year I took some time off to study online convex optimization in some detail. The reason for doing that was similar to the reason why at some point I took time off to study spectral graph theory: it was coming up in several papers that I wanted to understand, and I felt that I was missing out by not mastering an important tool. In particular, I wanted to understand:

1. The Barak-Hardt-Kale proof of the Impagliazzo hard-core lemma.
2. The online convex optimization viewpoint on the Frieze-Kannan weak regularity lemma, on the dense model theorem of (RTTV), and on the abstract weak regularity lemma of (TTV) that were described to me by Madhur Tulsiani a few years ago. Furthermore, I wanted to see if Russel Impagliazzo’s subsequent improvements to the dense model theorem and to the abstract weak regularity lemma could be recovered from this point of view.
3. The Arora-Kale algorithms for semidefinite programming, including their nearly linear-time algorithm for approximating the Goemans-Williamson relaxation of Max Cut.
4. The meaning of the sentence “multiplicative weights and gradient descent are both special cases of follow-the-regularized-leader, using negative entropy and ${\ell_2^2}$ as regularizer, respectively.”
5. The AllenZhu-Liao-Orecchia online optimization proof of the Batson-Spielman-Srivastava sparsification result.

I am happy to say that, except for the “furthermore” part of (2), I achieved my goals. To digest this material a bit better, I came up with the rather ambitious plan of writing a series of posts, in which I would alternate between (i) explaining a notion or theorem from online convex optimization (at a level that someone learning about optimization or machine learning might find useful) and (ii) explaining a complexity-theoretic application. Now that a very intense Spring semester is almost over, I plan to get started on this plan, although it is not clear that I will see it through the end. So stay tuned for the forthcoming first episode, which will be about the good old multiplicative weights algorithm.

The Early Years of Computing in Italy

Here are in theory‘s first ever book reviews! The books are

Giorgio Garuzzo
Quando in Italia si facevano i computer

Giorgio Ausiello
The Making of a New Science

Both books talk about the early years of computing in Italy, on the industrial and academic side, respectively. They briefly intersect with the story of Olivetti’s Elea computer.

Knuth Prize to Avi Wigderson

Congratulations to the Knuth prize committee chaired by Avrim Blum for the excellent choice of awarding the 2019 Knuth prize to Avi Wigderson.

Avi has worked on all aspects of computational complexity theory, and he has had a transformative influence on the way theoretical computer science relates to pure mathematics. I will not repeat what I wrote about his work on the occasion of his 60th birthday here and here. Long-term readers of in theory will remember that I consider him as one of the saints of computational complexity.

The organizers of the coming FOCS would like me to remind you that the deadline is this Friday, and that someone, for some reason, has set up a fake submission site (on the domain aconf dot org) but the true submission site (that, to be honest, looks less legit than the fake one) is at focs19.cs.utexas.edu.

Also, the deadline to submit nominations for the inaugural FOCS test of time award is in three weeks. There will be three awards, one for papers appeared in FOCS 1985-89, one for FOCS 1995-99 and one for FOCS 2005-09.

On an unrelated note, GMW appeared in FOCS 1986 and the Nisan-Wigderson “Hardness versus randomness” paper appeared in FOCS 1988.

And now for something completely different

After 22 years in the United States, 19 of which spent in the San Francisco Bay Area, this Summer I will move to Milan to take a job at Bocconi University.

Like a certain well-known Bay Area institution, Bocconi is a private university that was endowed by a rich merchant in memory of his dead son. Initially characterized by an exclusive focus on law, economics and business, it has had for a while a high domestic recognition for the quality of teaching and, more recently, a good international profile both in teaching and research. Despite its small size, compared to Italy’s giant public universities, in 2017 Bocconi was the Italian university which had received the most ERC grants during the first ten years of existence of the European Research Council (in second place was my Alma Mater, the Sapienza University of Rome, which has about nine times more professors) (source).

About three years ago, Bocconi started planning for a move in the space of computing, in the context of their existing efforts in data science. As a first step, they recruited Riccardo Zecchina. You may remember Riccardo from his work providing a non-rigorous calculation of the threshold of random 3-SAT, his work on the “survey propagation” algorithm for SAT and other constraint satisfaction problems, as well as other work that brought statistical physics techniques to computer science. Currently, Riccardo and his group are doing very exciting work on the theory of deep learning.

Though I knew of his work, I had never met Riccardo until I attended a 2017 workshop at the Santa Fe Institute on “Thermodynamics and computation,” an invitation that I had accepted on whim, mostly based on the fact that I had never been to New Mexico and I had really liked Breaking Bad. Riccardo had just moved to Bocconi, he told me about their future plans, and he asked me if I was interested. I initially politely declined, but one thing led to another, and now here I am putting up my San Francisco house for sale.

Last August, as I was considering this move, I applied for an ERC grant from the European Union, and I just learned that the grant has been approved. This grant is approximately the same amount as the total of all the grants that I have received from the NSF over the past twenty years, and it will support several postdoc positions, as well as visitors ranging from people coming for a week to give a talk and meet with my group to a full-year sabbatical visit.

Although it’s a bit late for that, I am looking for postdocs starting as early as this September: if you are interested please contact me. The postdoc positions will pay a highly competitive salary, which will be free of Italian income tax (although American citizens will owe federal income tax to the IRS correction: American citizens would not owe anything to IRS either). As a person from Rome, I am not allowed to say good things about Milan or else I will have to return my Roman card (it’s kind of a NY versus LA thing), but I think that the allure of the city speaks for itself.

Likewise, if you are a senior researcher, and you have always wanted to visit me and work together on spectral methods, approximation algorithms, graph theory or graph algorithms, but you felt that Berkeley had insufficiently many Leonardo mural paintings and opera houses, and that it was too far from the Alps, then now you are in luck!

Tested by time

I was delighted (and not at all surprised) to hear that this year’s Turing Award will go to LeCun, Hinton, and Y. Bengio for their work on deep learning.

Like public-key cryptography, deep learning was ahead of its time when first studied, but, thanks to the pioneering efforts of its founders, it was ready to be used when the technology caught up.

Mathematical developments take a long time to mature, so it is essential that applied mathematical research be done ahead of the time of its application, that is, at a time when it is basic research. Maybe quantum computing will be the next example to teach this lesson.

By the way, this summer the Simons Institute will host a program on the foundations of deep learning, co-organized by Samy Bengio, Aleks Madry, Elchanan Mossel and Matus Telgarsky.

Sometimes, it is not just the practical applications of a mathematical advance that take time to develop: the same can be true even for its theoretical applications! Which brings me to the next announcement of this post, namely that the call for nominations for the FOCS test of time award is out. Nominations are due in about four weeks.

Average Case Complexity News

Greetings from Taiwan, a tropical country with democracy, free press, and rule of law, in which the trains run on time, and the food is awesome. Also people are friendly, drivers don’t honk, and the “close doors” buttons in elevators actually work. Talk about exceptionalism! On November 24, In Theory endorses voting No on 10, No on 11, No on 12, Yes on 13, Yes on 14 and Yes on 15.

More on Taiwan in a later post. Today I would like to give a couple of updates on the survey paper on average-case complexity theory that Andrej Bogdanov and I wrote in 2006: apologies to the readers for a number of missing references, and news on the front of worst-case to average-case reductions.