*(Image credit: The New Yorker)*

新年快乐！

]]>The positions have a very competitive salary and relocation benefits. Funding for travel is available.

Application information is at this link. The deadline is **December 15**. If you apply, please also send me an email (L.Trevisan at unibocconi.it) to let me know.

According to our national tradition, I like to complain, but Bocconi is really not giving me much to work with.

But enough about me, let’s talk about you. I am going to assume that you want to come to Milan, because, really, why not? Here are some ways in which this can happen:

**You are a high school senior and you would like to study computer science in college, but you like math as well**: next academic year, Bocconi is starting a new undergraduate program on math and CS**You are an undergraduate or masters student applying to PhD programs and you would like to work with me**: next academic year, Bocconi is starting a new PhD program on statistics and computer science**You are a theory PhD student and you are looking for what to do next summer**: I would be interested in hosting graduate students for part of next summer, especially during the month of July. Ask your advisor to contact me if this is something that you would be interested in**You are a graduating theory PhD student and you are looking for a postdoc next year**: I will have one or two openings for postdocs next year. The call for applications will be up soon. The tax-free salary will be very competitive and Bocconi has exceptionally well-functioning processes to get non-Italian-speakers settled in, help them do the immigration paperwork, look for housing, finding English-speaking primary care physicians, etc.**You are (or might be tempted to be) on the job market for a faculty position**: in light of all the new initiatives on computing, and the current staff of one computer science professor, Bocconi would like to hire at all levels, preferably at the levels of Associate Professor or Full Professor, in computer science, especially in theory and in AI. Salaries are very competitive, they are essentially tax-free for six years for people who have not lived in Italy in the past two years, all teaching is in English, and the university makes it very easy for foreigners to settle in.**You are a professor and your sabbatical is coming up, or you arranged your teaching to have a semester without teaching to do some traveling**: contact me if you are interested in visiting for any length of time.

I am in Baltimore for FOCS through Tuesday morning if you want to talk to me in person.

]]>For example, although, as of yesterday, I finally have working wired internet access in my place, I still do not have a bus card (obtaining the latter has been one of the most stubbornly intractable problems I have encountered) and all the stuff that I did not carry in two bags is still in transit in a container.

Meanwhile, the busyness of handling the move, getting settled, trying to get a bus card, and teaching two courses, has meant that I did not really have time to sit down with my thoughts and process my feelings about such a major life change. If people ask me what I miss about San Francisco I will, truthfully, say something like UberX, or Thai food, or getting a bus card from a vending machine, because I still have not had a chance to miss the bigger stuff. Similarly, this post will be about random small stuff.

Milan is lovely this time of the year. Unlike most other Italian cities, there is no well-preserved historical neighborhood: the city was devastated by WW2 bombings, and then rebuilt; in the center, majestic old buildings are scattered among generic 1960s era condos. Nonetheless, walking along the Naviglio Grande (a canal running roughly in an East-West direction flanked mostly by 100+ year old buildings) at sunset is quite beautiful.

Milan is one of the most expensive cities in Italy, but one can have an awesome sit-down multi-course lunch for 12 euros (tax included, no tip), or a sit-down coffee for one euro. In one of my new favorite dinner places, the old but energetic lady that runs the place comes to the table to offer a choice of a handful of first courses, a couple of second courses and a couple of desserts (no menu); last time I was there she charged 4 euros for two glasses of wine.

Bocconi keeps rolling out its strategy to expand into computing. After introducing an undergraduate degree in Economics, Management and Computer Science, and a masters degree in Data Science, next year will see the start of a new PhD Program in Statistics and Computer Science. The deadline to apply is in February, and I would be happy to talk to students interested in applying. The language of instruction, like for almost all degree programs at Bocconi, is English. A new undergraduate program on computing is in the final planning stages, and I will say more about it when it is official.

Between teaching two courses, getting settled, and trying to get a bus card, I have not had a lot of time for research, but the local physicists have been giving us a crash course on the replica method to study optimization problems on random instances. I plan, in the near future, to complete my series of posts on online optimization, and to write whatever I understand of the replica method. An example of what is going on:

physicist: “so the smallest eigenvalue of this nxn matrix is , which is negative, and this is a problem because it means that the matrix is not PSD”

me: “wait, what?”

physicist: “don’t forget that we are interested in the limit .”

Coming here, I thought that my command of the Italian language, which I haven’t spoken on a regular basis for a while, would be almost perfect, and this has been true, except that the “almost” part has come from unexpected directions.

For example, one can find a number of places serving real (and very good) Chinese food, which was not the case at the time I grew up in Italy. However, I have no idea how Chinese dishes are called in Italian, for the most part I don’t know how they are called in Chinese either, so I have to resort to the English menu, if available, or to guesswork. Quite reasonably, dumplings are called “ravioli” and noodles are called “pasta” or “spaghetti.” In some places, buns are called, also quite reasonably, “panini”. That dish that I like that is usually called, with much understatement, “boiled fish with preserved vegetables” in America? “Zuppa di pesce”. How do you say “bok choy”? I am still not sure.

In Italian, like in French, you can address someone in a formal or an informal way (in Italian the formal way is to use the third person) and the subtle cues of when to use which form have shifted a little with time, so I have been feeling a bit confused and I have mostly been following other people’s lead.

Bocconi is an exceptionally well run institution. This is true in many ways, from maintenance and cleaning all the way to campus strategic planning, but one of the things that most impressed me, a luxury you do not see on American campuses any more, is that they have *secretaries*. Not directors of operations, not grant administrators, not coordinators of this or that with lots of underlings, but people that you can go talk to, describe a problem you are having or something you need, and then walk away and forget about it, because they will take care of it, with lots of common sense and professionalism. This is *amazing*! The cognitive load of worrying about lots of things other than teaching and research is something that it feels just wonderful not to have.

Despite this, the Bocconi administration operates on a shoestring budget compared to American universities. How does it do it? I am collecting figures and I plan to write a separate post about it. If only the Bocconi administration was in charge of issuing bus cards for the city!

It should be noted that, while it has a lot of autonomy as a private university, Bocconi is not a sovereign entity and it is subject to Italian law. For example, I recently had to see two doctors who had to certify that I had no medical condition that would make it dangerous for me to be a professor, and the procedure to hire postdocs is a bit Byzantine (but the secretaries take care of most of it!).

In conclusion, I had some very intense, very exciting, and very rewarding six weeks and, while I do not yet have a bus card, I have a medical certificate that I can teach and do research. Now I am off to eat some Chinese ravioli and maybe a Chinese panino.

]]>Dear friends,

We are happy to announce the birth of a new conference on Information-Theoretic Cryptography (ITC). Information-theoretic cryptography studies security in the presence of computationally unbounded adversaries and covers a wide array of topics at the intersection of cryptography, coding theory, information-theory and theory of computation. Notable examples include randomness extraction and privacy amplification, secret sharing, secure multiparty computation and proof systems, private-information retrieval and locally decodable codes, authentication codes and non-malleable codes, differential privacy, quantum information processing, and information-theoretic foundations of physical-layer security. See https://itcrypto.github.io for more information.

ITC replaces the International Conference on Information Theoretic Security (ICITS), which was dedicated to the same topic and ran 2005-2017. ITC can be seen as a reboot of ICITS with a new name, a new steering committee and a renewed excitement. (beware: there is a fake website for ICITS 2019 created by a known fraudulent organization)

The conference will have two tracks: a conference track and a “greatest hits” track. The conference track will operate like a traditional conference with the usual review process and published proceedings. The “greatest hits” track consists of invited talks (not included in the proceedings) that highlight the most exciting recent advances in the area. We solicit nominations for “greatest hits” talks from the community.

The first ITC conference will take place in Boston, MA on June 17-19, 2020 (just before STOC). The submission deadline for ITC 2020 is Dec 16, 2019 and the call for papers (including a nomination procedure for the greatest hits track) is available here: https://itcrypto.github.io/2020.html

Please submit your best work to ITC 2020! We hope to see many of you there!

best regards,

The Steering Committee: Benny Applebaum (Chair), Ivan Damgård , Yevgeniy Dodis, Yuval Ishai, Ueli Maurer, Kobbi Nissim, Krzysztof Pietrzak, Manoj Prabhakaran, Adam Smith, Yael Tauman Kalai, Stefano Tessaro, Vinod Vaikuntanathan, Hoeteck Wee, Daniel Wichs, Mary Wootters, Chaoping Xing, Moti Yung

]]>This seemed like a reasonable model to balance profitability for publishers and open access, but there was no way to agree on it with Elsevier. Meanwhile, U.C. has not renewed its Elsevier subscriptions and Elsevier has cut off access to U.C. libraries.

I was very impressed to see the University of California central administration do something right, so I wondered if this was the kind of portent that is a harbinger of the apocalypse, or just a fluke. Subsequent events suggest the latter.

The University of California has spent a lot of time and money to build a centralized system for job applications and for job applicant review. I was first made aware of this when I chaired the recruiting committee for the Simons Director position. At first we were told that we could solicit applications through the (vastly superior) EECS-built system for job applications and reviews. After the application deadline passed, we were told that, in fact, we could *not* use the EECS system, and so the already overworked EECS faculty HR person had to manually copy all the data in the central campus system.

The American Mathematical Society has created a wonderfully functional system, called Mathjobs where applicants for academic mathematics jobs (ranging from postdocs to professorship) can upload their application material once, and their recommenders can upload their letters once, and then all the universities that the candidate applies to have access to this material. Furthermore, if needed, both applicants and recommenders can tailor-make their material for a particular university or universities, if they want to.

Everybody was living happily, but not ever after, because the U.C. central campus administration decided that *everybody* in the University of California had to use the centralized system for *all* jobs. Both the AMS and U.C. mathematicians tried to find a reasonable accommodation, such as allowing the U.C. system to access the letters posted on mathjobs. The campus administration reasoned response was roughly “sucks to be you.” There is more of the story in an AMS notices article by the chair of math at U.C. Davis.

Finally, this year U.C. Berkeley will not be listed in the US News and World Report rankings because it has submitted wrong data in the past.

]]>where is the convex set of feasible solutions that the algorithm is allowed to produce, is the linear loss function at time , and is the strictly convex regularizer.

If we have an unconstrained problem, that is, if , then the optimization problem (1) has a unique solution: the such that

and we can usually both compute efficiently in an algorithm and reason about effectively in an analysis.

Unfortunately, we are almost always interested in constrained settings, and then it becomes difficult both to compute and to reason about it.

A very nice special case happens when the regularizer acts as a *barrier function* for , that is, the (norm of the) gradient of goes to infinity when one approaches the boundary of . In such a case, it is impossible for the minimum of (1) to occur at the boundary and the solution will be again the unique in such that

We swept this point under the rug when we studied FTRL with negative-entropy regularizer in the settings of experts, in which is the set of probability distributions. When we proceeded to solve (1) using Lagrange multipliers, we ignored the non-negativity constraints. The reason why it was ok to do so was that the negative-entropy is a barrier function for the non-negative orthant .

Another important special case occurs when the regularizer is a multiple of length-squared. In this case, we saw that we could “decouple” the optimization problem by first solving the unconstrained optimization problem, and then projecting the solution of the unconstrained problem to :

Then we have the closed-form solution and, depending on the set , the projection might also have a nice closed-form, as in the case that comes up in results related to regularity lemmas.

As we will see today, this approach of solving the unconstrained problem and then projecting on works for every regularizer, for an appropriate notion of projection called the *Bregman projection* (the projection will depend on the regularizer).

To define the Bregman projection, we will first define the *Bregman divergence* with respect to the regularizer , which is a non-negative “distance” defined on (or possibly a subset of for which the regularizer is a barrier function). Then, the Bregman projection of on is defined as .

Unfortunately, it is not so easy to reason about Bregman projections either, but the notion of Bregman divergence offers a way to reinterpret the FTRL algorithm from another point of view, called *mirror descent*. Via this reinterpretation, we will prove the regret bound

which carries the intuition that the regret comes from a combination of the “distance” of our initial solution from the offline optimum and of the “stability” of the algorithm, that is, the “distance” between consecutive soltuions. Nicely, the above bound measures both quantities using the same “distance” function.

**1. Bregman Divergence and Bregman Projection **

For a strictly convex function , we define the *Bregman divergence* associated to as

that is, the difference between the value of at and the value of the linear approximation of at (centered at ). By the strict convexity of we have and iff . These properties suggest that we may think of as a kind of “distance” between and , which is a useful intuition although it is important to keep in mind that the divergence need not be symmetric and need not satisfy the triangle inequality.

Now we show that, assuming that is well defined and strictly convex on all , and that the losses are linear, the constrained optimization problem (1) can be solved by first solving the unconstrained problem and then “projecting” the solution on by finding the point in of smallest Bregman divergence from the unconstrained optimum:

The proof is very simple. The optimum of the unconstrained optimization problem is the unique such that

that is, the unique such that

On the other hand, is defined as

that is,

where the second equality above follows from the fact that two functions that differ by a constant have the same optimal solutions.

Indeed we see that the above “decoupled” characterization of the FTRL algorithm would have worked for any definition of a function of the form

and that our particular choice of what “stuff dependent only on ” to add makes which is reasonable for something that we want to think of as a “distance function.”

Note that, in all of the above, we can replace with a convex set provided that is a barrier function for . In that case

is the unique such that

and everything else follows analogously.

**2. Examples **

** 2.1. Bregman Divergence of Length-Squared **

If , then

so Bregman divergence is distance-squared, and Bregman projection is just (Euclidean) projection.

** 2.2. Bregman Divergence of Negative Entropy **

If, for , we define

then the associated Bregman divergence is the generalized *KL divergence.*

where so that

Note that, if and are probability distributions, then the final two terms above cancel out, leaving just the KL divergence .

**3. Mirror Descent **

We now introduce a new perspective on FTRL.

In the unconstrained setting, if is a strictly convex function and is the associated Bregman divergence, the *mirror descent* algorithm for online optimization has the update rule

The idea is that we want to find a solution that is good for the past loss functions, but that does not “overfit” too much. If, in past steps, had been chosen to be such a solution for the loss functions , then, in choosing , we want to balance staying close to but also doing well with respect to , hence the above definition.

Theorem 1Initialized with , the unconstrained mirror descent algorithm is identical to FTRL with regularizer .

*Proof:* We will proceed by induction on . At , the definition of is the same. For larger , we know that FTRL will choose the unique such that , so we will assume that this is true for the mirror descent algorithm for and prove it for .

First, we note that the function is strictly convex, because it equals

and so it is a sum of a strictly convex function , linear functions in , and constants independent of . This means that is the unique point at which the gradient of the above function is zero, that is,

and so

and, using the inductive hypothesis, we have

as desired.

In the constrained case, there are two variants of mirror descent. Using the terminology from Elad Hazan’s survey, *agile* mirror descent is the natural generalization of the unconstrained algorithm:

Following the same steps as the proof in the previous section, it is possible to show that agile mirror descent is equivalent to solving, at each iteration, the “decoupled” optimization problems

That is, we can first solve the unconstrained problem and then project on . (Again, we can always replace by a set for which is a barrier function and such that .)

The *lazy* mirror descent algorithm has the update rule

The initialization is

Fact 2Lazy mirror descent is equivalent to FTRL.

*Proof:* The solutions are the unconstrained optimum of FTRL, and is the Bregman projection of on . We proved in the previous section that this characterizes constrained FTRL.

What about agile mirror descent? In certain special cases it is equivalent to lazy mirror descent, and hence to FTRL, but it usually leads to a different set of solutions.

We will provide an analysis of lazy mirror descent, but first we will give an analysis of the regret of unconstrained FTRL in terms of Bregman divergence, which will be the model on which we will build the proof for the constrained case.

**4. A Regret Bound for FTRL in Terms of Bregman Divergence **

In this section we prove the following regret bound.

Theorem 3Unconstrained FTRL with regularizer satisfies the regret bound

where is the Bregman divergence associated with .

We will take the mirror descent view of unconstrained FTRL, so that

We proved that

This means that we can rewrite the regret suffered at step with respect to as

and the theorem follows by adding up the above expression for and recalling that .

Unfortunately I have no geometric intuition about the above identity, although, as you can check yourself, the algebra works neatly.

**5. A Regret Bound for Agile Mirror Descent **

In this section we prove the following generalization of the regret bound from the previous section.

Theorem 4Agile mirror descent satisfies the regret bound

The first part of the update rule of agile mirror descent is

and, following steps that we have already carried out before, satisfies

This means that we can rewrite the regret suffered at step with respect to as

where the same mystery cancellations as before make the above identity true.

Now I will wield another piece of magic, and I will state without proof the following fact about Bregman projections

Lemma 5If and is the Bregman projection on of a point , then

That is, if we think of as a “distance,” the distance from to its closest point in plus the distance from to is at most the distance from to . Note that this goes in the opposite direction as the triangle inequality (which ok, because typically does not satisfy the triangle inequality).

In particular, the above lemma gives us

and so

Now summing over and recalling that we have our theorem.

]]>

I would like to congratulate my Taiwanese readers for being in the first Asian country to introduce same-sex marriage.

]]>
The Szemeredi Regularity Lemma states (in modern language) that every dense graph is well approximate by a graph with a very simple structure, made of the (edge-disjoint) union of a constant number of weighted complete bipartite subgraphs. The notion of approximation is a bit complicated to describe, but it enables the proof of *counting lemmas*, which show that, for example, the number of triangles in the original graph is well approximated by the (appropriately weighted) number of triangles in the approximating graph.

Analogous regularity lemmas, in which an arbitrary object is approximated by a low-complexity object, have been proved for hypergraphs, for subsets of abelian groups (for applications to additive combinatorics), in an analytic setting (for applications to graph limits) and so on.

The *weak regularity lemma* of Frieze and Kannan provides, as the name suggests, a weaker kind of approximation than the one promised by Szemeredi’s lemma, but one that is achievable with a graph that has a much smaller number of pieces. If is the “approximation error” that one is willing to tolerate, Szemeredi’s lemma constructs a graph that is the union of a weighted complete bipartite subgraphs where the height of the tower of exponentials is polynomial in . In the Frieze-Kannan construction, that number is cut down to a single exponential . This result too can be generalized to graph limits, subsets of groups, and so on.

With Tulsiani and Vadhan, we proved an abstract version of the Frieze-Kannan lemma (which can be applied to graphs, functions, distributions, etc.) in which the “complexity” of the approximation is . In the graph case, the approximating graph is still the union of complete bipartite subgraphs, but it has a more compact representation. One consequence of this result is that for every high-min-entropy distribution , there is an efficiently samplable distribution with the same min-entropy as , that is indistinguishable from . Such a result could be taken to be a proof that what GANs attempt to achieve is possible in principle, except that our result requires an unrealistically high entropy (and we achieve “efficient samplability” and “indistinguishability” only in a weak sense).

All these results are proved with a similar strategy: one starts from a trivial approximator, for example the empty graph, and then repeats the following iteration: if the current approximator achieves the required approximation, then we are done; otherwise take a counterexample, and modify the approximator using the counterexample. Then one shows that:

- The number of iterations is bounded, by keeping track of an appropriate potential function;
- The “complexity” of the approximator does not increase too much from iteration to iteration.

Typically, the number of iterations is , and the difference between the various results is given by whether at each iteration the “complexity” increases exponentially, or by a multiplicative factor, or by an additive term.

Like in the post on pseudorandom constructions, one can view such constructions as an online game between a “builder” and an “inspector,” except that now the online optimization algorithm will play the role of the builder, and the inspector is the one acting as an adversary. The bound on the number of rounds comes from the fact that the online optimization algorithms that we have seen so far achieve amortized error per round after rounds, so it takes rounds for the error bound to go below .

We will see that the abstract weak regularity lemma of my paper with Tulsiani and Vadhan (and hence the graph weak regularity lemma of Frieze and Kannan) can be immediately deduced from the theory developed in the previous post.

When I was preparing these notes, I was asked by several people if the same can be done for Szemeredi’s lemma. I don’t see a natural way of doing that. For such results, one should maybe use the online optimization techniques as a guide rather than as a black box. In general, iterative arguments (in which one constructs an object through a series of improvements) require the choice of a potential function, and an argument about how much the potential function changes at every step. The power of the FTRL method is that it creates the potential function and a big part of the analysis automatically and, even where it does not work directly, it can serve as an inspiration.

One could imagine a counterfactual history in which people first proved the weak regularity lemma using online optimization out of the box, as we do in this post, and then decided to try and use an L2 potential function and an iterative method to get the Szemeredi lemma, subsequently trying to see what happens if the potential function is entropy, thus discovering Jacob Fox’s major improvement on the “triangle removal lemma,” which involves the construction of an approximator that just approximates the number of triangles.

**1. A “vanilla” weak regularity lemma **

Frieze and Kannan proved the following basic result about graph approximations, which has a number of algorithmic applications. If is a set of vertices which is understood from the context, and are disjoint subsets of vertices, then let , that is, the boolean matrix such that iff and .

The *cut norm* of a matrix is

In the following we will identify a graph with its adjacency matrix.

Theorem 1Let be an graph on vertices and be an approximation parameter.Then there are sets and scalars , where , such that if we define

we have

We will prove the following more general version.

Theorem 2Let be a set, be a bounded function, be a family of functions mapping to and be an approximation parameter. Then there are functions in and scalars , with , such that if we definewe have

We could also, with the same proof, argue about a possibly infinite set with a measure such that is finite, and, after defining the inner product

we could prove the same conclusion of the theorem, with instead of as an error bound.

Here is the proof: run the FTRL algorithm with L2-squared regularizer in the setup in which the space of solutions is the set of all functions and the loss functions are linear. Every time the algorithm proposes a solution , if there is a function such that either or , the adversary will pick, respectively, or as a loss function . When the adversary has no such choice, we stop and the function is our desired approximation.

First of all, let us analyze the number of rounds. Here the maximum norm of the functions in is , so after rounds we have the regret bound

Now let us consider to be our offline solution: we have

which implies

Finally, recall that

where is the scaling constant in the definition of the regularizer ( is of order of when is order of ), and so our final approximator computed at the last round is a weighted sum of functions from .

**2. The weak regularity lemma **

Frieze and Kannan’s weak regularity lemma has the following form.

Theorem 3Let be an graph on vertices and be an approximation parameter.Then there is a partition of into sets , and there are bounded weights for such that if we defined the weighted graph where the weight of the edge in is , where and , then we have

Notice that if we did not require the weights to be between 0 and 1 then the result of the previous section can also be cast in the above language, because we can take the partition to be the “Sigma-algebra generated by” the sets .

For a scalar , let be defined as

where stands for *t*runcation. Note that is the L2 projection of on .

Theorem 3 is a special case of the following result, proved in our paper with Tulsiani and Vadhan.

Theorem 4Let be a set, be a bounded function, be a family of functions mapping to and be an approximation parameter. Then there are functions in and scalars , with , such that if we definewe have

To prove Theorem 4 we play the same online game as in the previous section: the online algorithm proposes a solution ; if then we stop and output , otherwise we let the loss function be a function such that either or is in and

The only difference is that we use the FTRL algorithm with L2 regularizer that has the set feasible solutions defined to be the set of all functions rather than the set of all functions . Then each function is the projection to of , and the projection to is just composition with . The bound on the number of steps is the same as the one in the previous section.

Looking at the case in which is the set of edges of a clique on , is the set of graphs of the form , and considering the Sigma-algebra generated by gives Theorem 3 from Theorem 4.

**3. Sampling High-Entropy Distributions **

Finally we discuss the application to sampling high-entropy distributions.

Suppose that is a distribution over of min-entropy , meaning that for every we have

where we think of the *entropy deficiency* as being small, such as a constant or

Let be a class of functions that we think of as being “efficient.” For example, could be the set of all functions computable by circuits of size for some size bound , such as, for example . We will assume that is in . Define

to be a bounded function . Fix an approximation parameter .

Then from Theorem 4 we have that there are functions , and scalars , all equal to for a certain parameter , such that if we define

Now define the probability distribution

Applying (1) to the case , we have

and we know that , so

and we can rewrite (2) as

and, finally

that is

which says that and are -indistinguishable by functions in . If we chose , for example, to be the class of functions computable by circuits of size , then and are -indistinguishable by circuits of size .

But is also samplable in a relatively efficient way using rejection sampling: pick a random , then output with probability and fail with probability . Repeat the above until the procedure does not fail. At each step, the probability of success is , so, assuming (because otherwise all of the above makes no sense) that, say, , the procedure succeeds on average in at most attempts. And if each is computable by a circuit of size , then is computable by a circuit of size .

The undesirable features of this result are that the complexity of sampling and the quality of indistinguishability depend exponentially on the randomness deficiency, and the sampling circuit is a non-uniform circuit that it’s not clear how to construct without advice. Impagliazzo’s recent results address both these issues.

]]>

Furthermore, it is not clear how we would generalize the ideas of multiplicative weights to the case in which the set of feasible solutions is anything other than the set of distributions.

Today we discuss the *“Follow the Regularized Leader”* method, which provides a framework to design and analyze online algorithms in a versatile and well-motivated way. We will then see how we can “discover” the definition and analysis of multiplicative weights, and how to “discover” another online algorithm which can be seen as a generalization of projected gradient descent (that is, one can derive the projected gradient descent algorithm and its analysis from this other online algorithm).

**1. Follow The Regularized Leader **

We will first state some results in full generality, making no assumptions on the set of feasible solutions or on the set of loss functions encountered by the algorithm at each step.

Let us try to define an online optimization algorithm from scratch. The solution proposed by the algorithm at time can only depend on the previous cost functions ; how should it depend on it? If the offline optimal solution is consistently better than all others at each time step, then we would like to be that solution, so we want to be a solution that would have worked well in the previous steps. The most extreme way of implementing this idea is the *Follow the Leader* algorithm (abbreviated FTL), in which we set the solution at time

to be the best solution for the previous steps. (Note that the algorithm does not prescribe what solution to use at step .)

It is possible for FTL to perform very badly. Consider for example the “experts” setting in which we analyzed multiplicative weights: the set of feasible solutions is the set of probability distributions over , and the cost functions are linear with coefficients . Suppose that and that . Then a possible run of the algorithm could be:

- ,
- ,
- ,
- ,
- ,

In which, after steps, the algorithm suffers a loss of while the offline optimum is . Thus, the regret is about , which compares very unfavorably to the regret of the multiplicative weight algorithm. For general , a similar example shows that the regret of FTL can be as high as about .

In the above bad example, the algorithm keeps “overfitting” to the past history: if an expert is a bit better than the others, the algorithm puts all its probability mass on that expert, and the algorithm keeps changing its mind at every step. Interestingly, this is the only failure mode of the algorithm.

Theorem 1 (Analysis of FTL)For any sequence of cost functions and any number of time steps , the FTL algorithm satisfies the regret bound

So that if the functions are Lipschitz with respect to a distance function on , then the only way for the regret to be large is for to typically be far, in that distance, from .

*Proof:* Recalling the definition of regret,

We will prove (1) by induction. The base case is just the definition of . Assuming that $latex {(1)}&fg=000000$ is true up to we have

where the middle step follows from the use of the inductive assumption, which gives

The above example and analysis suggest that we should modify FTL in such a way that the choices of the algorithm don’t change too much from step to step, and that the solution at time should be a compromise between optimizing with respect to previous cost functions and not changing too much from step to step.

In order to do this, we introduce a new function , called a *regularizer* (more on it later), and, at each step, we compute the solution

This algorithm is called *Follow the Regularized Leader* or FTRL. Typically, the function is chosen to be strictly convex and to take values that are rather big in magnitude. Then will be the unique minimum of and, at each subsequent step, will be selected in a way to balance the pull toward the minimum of and the pull toward the FTL solution . In particular, if is large in magnitude compared to each , the solution will not change too much from step to step.

We have the following analysis that makes no assumptions on , on the cost functions and on the regularizer (not even that the regularizer is convex).

Theorem 2 (Analysis of FTRL)For every sequence of cost functions and every regularizer function, the regret after steps of the FTRL algorithm is bounded as follows: for every ,where

*Proof:* Let us run for steps the FTRL algorithm with regularizer and cost functions , and call the solutions computed by the FTL algorithm.

Now consider the following mental experiment: we run the FTL algorithm for steps, with the sequence of cost functions , and we use as a first solution. Then we see that the solutions computed by the FTL algorithm will be precisely . The regret bound for FTL implies that, for every ,

Having established these results, the general recipe to solve an online optimization problem will be to find a regularizer function such that the minimum of “pulls away from” solutions that would make the FTL algorithm overfit, and such that there is a good balance between how big gets over (because we pay in the regret, where is the offline optimum) and how stable is the minimum of as varies.

**2. Negative-Entropy Regularization **

Let us consider again the “experts” setting, that is, the online optimization setup in which is the set of probability distributions over and the cost functions are linear with bounded coefficients.

The example we showed above showed that FTL will tend to put all the probability mass on one expert. We would like to choose a regularizer that fights this tendency by penalizing “concentrated” distributions and favoring “spread-out” distributions. This observation might trigger the thought that the *entropy* of a distribution is a good measure of how concentrated or spread out it is, although the entropy is actually higher for spread-out distribution and smaller for concentrated ones. So we will use as a regularizer *minus the entropy*, multiplied by an appropriate scaling factor:

(Entropy is usually defined using logarithms in base 2, but using natural logarithms will make it cleaner to take derivatives, and it only affects the constant factor .) With this choice of regularizer, we have

To compute the minimum of the above function we will use the method of Lagrange multipliers. Specialized to our setting, the method of Lagrange multiplier states that if we want to solve the constrained minimization problem

we introduce a new parameter and define the function

Then it is possible to prove that if is a feasible minimizer of , then there is at least a value of such that , that is, such that is a stable point of . So one can proceed by finding all such that and then filtering out the values of such that , and finally looking at which of the remaining minimizes .

Ignoring for a moment the non-negativity constraints, the constraint reduces to , so we have to consider the function

The partial derivative of the above expression with respect to is

If we want the gradient to be zero then we want all the above expressions to be zero, which translates to

There is only one value of that makes the above solution a probability distribution, and the corresponding solution is

Notice that this is exactly the solution computed by the multiplicative weights algorithm, if we choose . So we have “rediscovered” the multiplicative weights algorithm and we have also “explained” what it does: at every step it balances the goals of finding a solution that is good for the past and that has large entropy.

Now it remains to bound, at each time step,

For this, it is convenient to return to the notation that we used in describing the multiplicative weights algorithm, that is, it is convenient to work with the weights defined as

so that, at each time step

We are assuming , so the weights are non-increasing with time. Then

For every we have , so

and

Putting it all together, we have

Choosing , we have

Thus, we have reconstructed the analysis of the multiplicative weights algorithm.

Interestingly, the analysis that we derived today is not exactly identical to the one from the post on multiplicative weights. There, we derived the bound

while here, setting , we derived

where is the offline optimum and is the entropy function (computed using natural logarithms).

**3. L2 Regularization **

Now that we have a general method, let us apply it to a new context: suppose that, as before, our cost functions are linear, but let . With linear cost functions and no bound on the size of solutions, it will not be possible to talk about regret with respect to the offline optimum, because the offline optimum will always be , but it will be possible to talk about regret with respect to a particular offline solution , which will already lead to interesting consequences.

What regularizer should we use? In reasoning about regularizers, it can be helpful to think about what would go wrong if we use FTL, and then considering what regularizer would successfully “pull away” from the bad solutions found by FTL. In this context of linear loss functions and unbounded solutions, FTL will pick an infinitely big solution at each step, or, to be more precise, the “max” in the definition of FTL is undefined. To fight this tendency of FTL to go off to infinity, it makes sense for the regularizer to be a measure of how big a solution is. Since we are going to have to compute derivatives, it is good to use a measure of “bigness” with a nice gradient, and is a natural choice. So, for a scale parameter to be optimized later, our regularizer will be

This tells us that

and

The function that we are minimizing in the above expression is convex, so we just have to compute the gradient and set it to zero

Which can be also expressed as

This makes perfect sense because, in the “experts” interpretation, we want to penalize the experts that performed badly in the past. Here we have no constraints on our allocations, so we simply decrease (additively this time, not multiplicatively) the allocation to the experts that caused a higher loss.

To compute the regret bound, we have

and so the regret with respect to a solution is

If we know a bound

then we can optimize and we have

** 3.1. Dealing with Constraints **

Consider now the case in which the loss functions are linear and is an arbitrary convex set. Using the same regularizer we have the algorithm

How can we solve the above constrained optimization problem? A very helpful observation is that we can first solve the unconstrained optimization and then project on , that is we can proceed as follows:

and we claim that we always have . The fact that we can reduce a regularized constrained optimization problem to an unconstrained problem and a projection is part of a broader theory that we will describe in a later post. For now, we will limit to prove the equivalence in this specific setting. First of all, we already have an expression for , namely

Now the definition of is

In order to bound the regret, we have to compute

and since L2 projections cannot increase L2 distances, we have

So the regret bound is

If is an upper bound to , and is an upper bound to the norm of all the loss vectors, then

which can be optimized to

** 3.2. Deriving the Analysis of Gradient Descent **

Suppose that is a convex function whose gradient is well defined at all points in , and that we are interested in minimizing . Then a way to reduce this problem to online optimization would be to use the function as loss function at each step. Then the offline optimum would be the minimizer of , and achieving small regret means that is close to the minimum of , and so the best is an approximate minimizer.

Unfortunately, this is not a very helpful idea, because if we ran an FTRL algorithm against an adversary that keeps proposing as a cost function at each step then we would have

which, for large , is essentially the same problem as minimizing , so we have basically reduced the problem of minimizing to itself.

Indeed, the power of the FTRL algorithm is that the algorithm does well even though it does not know the cost function, and if we keep using the same cost function at each step we are not making a good use of its power. Now, suppose that we use cost functions such that

Then, after steps, we have

meaning

and so one of the is an approximate minimizer. Indeed, using convexity, we also have

and so the average of the is also an approximate minimizer. From the point of view of exploiting FTRL do to minimize , cost functions as above work just as well as presenting as a cost functions at each step.

How do we find cost functions that satisfy the above two properties and for which the FTRL algorithm is easy to implement? The idea is to let be the linear approximation of at :

The condition is immediate, and

is a consequence of the convexity of .

The cost functions that we have defined are affine functions, that is, each of them equals a constant plus a linear function

Adding a constant term to a cost function does not change the iteration of FTRL, and does not change the regret (because the same term is added both to the solution found by the algorithm and to the offline optimum), so the algorithm is just initialized with

and then continues with the update rules

which is just projected gradient descent.

If we have known upper bounds

and

then we have

which means that to achieve additive error it is enough to proceed for steps.

]]>