This seemed like a reasonable model to balance profitability for publishers and open access, but there was no way to agree on it with Elsevier. Meanwhile, U.C. has not renewed its Elsevier subscriptions and Elsevier has cut off access to U.C. libraries.

I was very impressed to see the University of California central administration do something right, so I wondered if this was the kind of portent that is a harbinger of the apocalypse, or just a fluke. Subsequent events suggest the latter.

The University of California has spent a lot of time and money to build a centralized system for job applications and for job applicant review. I was first made aware of this when I chaired the recruiting committee for the Simons Director position. At first we were told that we could solicit applications through the (vastly superior) EECS-built system for job applications and reviews. After the application deadline passed, we were told that, in fact, we could *not* use the EECS system, and so the already overworked EECS faculty HR person had to manually copy all the data in the central campus system.

The American Mathematical Society has created a wonderfully functional system, called Mathjobs where applicants for academic mathematics jobs (ranging from postdocs to professorship) can upload their application material once, and their recommenders can upload their letters once, and then all the universities that the candidate applies to have access to this material. Furthermore, if needed, both applicants and recommenders can tailor-make their material for a particular university or universities, if they want to.

Everybody was living happily, but not ever after, because the U.C. central campus administration decided that *everybody* in the University of California had to use the centralized system for *all* jobs. Both the AMS and U.C. mathematicians tried to find a reasonable accommodation, such as allowing the U.C. system to access the letters posted on mathjobs. The campus administration reasoned response was roughly “sucks to be you.” There is more of the story in an AMS notices article by the chair of math at U.C. Davis.

Finally, this year U.C. Berkeley will not be listed in the US News and World Report rankings because it has submitted wrong data in the past.

]]>where is the convex set of feasible solutions that the algorithm is allowed to produce, is the linear loss function at time , and is the strictly convex regularizer.

If we have an unconstrained problem, that is, if , then the optimization problem (1) has a unique solution: the such that

and we can usually both compute efficiently in an algorithm and reason about effectively in an analysis.

Unfortunately, we are almost always interested in constrained settings, and then it becomes difficult both to compute and to reason about it.

A very nice special case happens when the regularizer acts as a *barrier function* for , that is, the (norm of the) gradient of goes to infinity when one approaches the boundary of . In such a case, it is impossible for the minimum of (1) to occur at the boundary and the solution will be again the unique in such that

We swept this point under the rug when we studied FTRL with negative-entropy regularizer in the settings of experts, in which is the set of probability distributions. When we proceeded to solve (1) using Lagrange multipliers, we ignored the non-negativity constraints. The reason why it was ok to do so was that the negative-entropy is a barrier function for the non-negative orthant .

Another important special case occurs when the regularizer is a multiple of length-squared. In this case, we saw that we could “decouple” the optimization problem by first solving the unconstrained optimization problem, and then projecting the solution of the unconstrained problem to :

Then we have the closed-form solution and, depending on the set , the projection might also have a nice closed-form, as in the case that comes up in results related to regularity lemmas.

As we will see today, this approach of solving the unconstrained problem and then projecting on works for every regularizer, for an appropriate notion of projection called the *Bregman projection* (the projection will depend on the regularizer).

To define the Bregman projection, we will first define the *Bregman divergence* with respect to the regularizer , which is a non-negative “distance” defined on (or possibly a subset of for which the regularizer is a barrier function). Then, the Bregman projection of on is defined as .

Unfortunately, it is not so easy to reason about Bregman projections either, but the notion of Bregman divergence offers a way to reinterpret the FTRL algorithm from another point of view, called *mirror descent*. Via this reinterpretation, we will prove the regret bound

which carries the intuition that the regret comes from a combination of the “distance” of our initial solution from the offline optimum and of the “stability” of the algorithm, that is, the “distance” between consecutive soltuions. Nicely, the above bound measures both quantities using the same “distance” function.

**1. Bregman Divergence and Bregman Projection **

For a strictly convex function , we define the *Bregman divergence* associated to as

that is, the difference between the value of at and the value of the linear approximation of at (centered at ). By the strict convexity of we have and iff . These properties suggest that we may think of as a kind of “distance” between and , which is a useful intuition although it is important to keep in mind that the divergence need not be symmetric and need not satisfy the triangle inequality.

Now we show that, assuming that is well defined and strictly convex on all , and that the losses are linear, the constrained optimization problem (1) can be solved by first solving the unconstrained problem and then “projecting” the solution on by finding the point in of smallest Bregman divergence from the unconstrained optimum:

The proof is very simple. The optimum of the unconstrained optimization problem is the unique such that

that is, the unique such that

On the other hand, is defined as

that is,

where the second equality above follows from the fact that two functions that differ by a constant have the same optimal solutions.

Indeed we see that the above “decoupled” characterization of the FTRL algorithm would have worked for any definition of a function of the form

and that our particular choice of what “stuff dependent only on ” to add makes which is reasonable for something that we want to think of as a “distance function.”

Note that, in all of the above, we can replace with a convex set provided that is a barrier function for . In that case

is the unique such that

and everything else follows analogously.

**2. Examples **

** 2.1. Bregman Divergence of Length-Squared **

If , then

so Bregman divergence is distance-squared, and Bregman projection is just (Euclidean) projection.

** 2.2. Bregman Divergence of Negative Entropy **

If, for , we define

then the associated Bregman divergence is the generalized *KL divergence.*

where so that

Note that, if and are probability distributions, then the final two terms above cancel out, leaving just the KL divergence .

**3. Mirror Descent **

We now introduce a new perspective on FTRL.

In the unconstrained setting, if is a strictly convex function and is the associated Bregman divergence, the *mirror descent* algorithm for online optimization has the update rule

The idea is that we want to find a solution that is good for the past loss functions, but that does not “overfit” too much. If, in past steps, had been chosen to be such a solution for the loss functions , then, in choosing , we want to balance staying close to but also doing well with respect to , hence the above definition.

Theorem 1Initialized with , the unconstrained mirror descent algorithm is identical to FTRL with regularizer .

*Proof:* We will proceed by induction on . At , the definition of is the same. For larger , we know that FTRL will choose the unique such that , so we will assume that this is true for the mirror descent algorithm for and prove it for .

First, we note that the function is strictly convex, because it equals

and so it is a sum of a strictly convex function , linear functions in , and constants independent of . This means that is the unique point at which the gradient of the above function is zero, that is,

and so

and, using the inductive hypothesis, we have

as desired.

In the constrained case, there are two variants of mirror descent. Using the terminology from Elad Hazan’s survey, *agile* mirror descent is the natural generalization of the unconstrained algorithm:

Following the same steps as the proof in the previous section, it is possible to show that agile mirror descent is equivalent to solving, at each iteration, the “decoupled” optimization problems

That is, we can first solve the unconstrained problem and then project on . (Again, we can always replace by a set for which is a barrier function and such that .)

The *lazy* mirror descent algorithm has the update rule

The initialization is

Fact 2Lazy mirror descent is equivalent to FTRL.

*Proof:* The solutions are the unconstrained optimum of FTRL, and is the Bregman projection of on . We proved in the previous section that this characterizes constrained FTRL.

What about agile mirror descent? In certain special cases it is equivalent to lazy mirror descent, and hence to FTRL, but it usually leads to a different set of solutions.

We will provide an analysis of lazy mirror descent, but first we will give an analysis of the regret of unconstrained FTRL in terms of Bregman divergence, which will be the model on which we will build the proof for the constrained case.

**4. A Regret Bound for FTRL in Terms of Bregman Divergence **

In this section we prove the following regret bound.

Theorem 3Unconstrained FTRL with regularizer satisfies the regret bound

where is the Bregman divergence associated with .

We will take the mirror descent view of unconstrained FTRL, so that

We proved that

This means that we can rewrite the regret suffered at step with respect to as

and the theorem follows by adding up the above expression for and recalling that .

Unfortunately I have no geometric intuition about the above identity, although, as you can check yourself, the algebra works neatly.

**5. A Regret Bound for Agile Mirror Descent **

In this section we prove the following generalization of the regret bound from the previous section.

Theorem 4Agile mirror descent satisfies the regret bound

The first part of the update rule of agile mirror descent is

and, following steps that we have already carried out before, satisfies

This means that we can rewrite the regret suffered at step with respect to as

where the same mystery cancellations as before make the above identity true.

Now I will wield another piece of magic, and I will state without proof the following fact about Bregman projections

Lemma 5If and is the Bregman projection on of a point , then

That is, if we think of as a “distance,” the distance from to its closest point in plus the distance from to is at most the distance from to . Note that this goes in the opposite direction as the triangle inequality (which ok, because typically does not satisfy the triangle inequality).

In particular, the above lemma gives us

and so

Now summing over and recalling that we have our theorem.

]]>

I would like to congratulate my Taiwanese readers for being in the first Asian country to introduce same-sex marriage.

]]>
The Szemeredi Regularity Lemma states (in modern language) that every dense graph is well approximate by a graph with a very simple structure, made of the (edge-disjoint) union of a constant number of weighted complete bipartite subgraphs. The notion of approximation is a bit complicated to describe, but it enables the proof of *counting lemmas*, which show that, for example, the number of triangles in the original graph is well approximated by the (appropriately weighted) number of triangles in the approximating graph.

Analogous regularity lemmas, in which an arbitrary object is approximated by a low-complexity object, have been proved for hypergraphs, for subsets of abelian groups (for applications to additive combinatorics), in an analytic setting (for applications to graph limits) and so on.

The *weak regularity lemma* of Frieze and Kannan provides, as the name suggests, a weaker kind of approximation than the one promised by Szemeredi’s lemma, but one that is achievable with a graph that has a much smaller number of pieces. If is the “approximation error” that one is willing to tolerate, Szemeredi’s lemma constructs a graph that is the union of a weighted complete bipartite subgraphs where the height of the tower of exponentials is polynomial in . In the Frieze-Kannan construction, that number is cut down to a single exponential . This result too can be generalized to graph limits, subsets of groups, and so on.

With Tulsiani and Vadhan, we proved an abstract version of the Frieze-Kannan lemma (which can be applied to graphs, functions, distributions, etc.) in which the “complexity” of the approximation is . In the graph case, the approximating graph is still the union of complete bipartite subgraphs, but it has a more compact representation. One consequence of this result is that for every high-min-entropy distribution , there is an efficiently samplable distribution with the same min-entropy as , that is indistinguishable from . Such a result could be taken to be a proof that what GANs attempt to achieve is possible in principle, except that our result requires an unrealistically high entropy (and we achieve “efficient samplability” and “indistinguishability” only in a weak sense).

All these results are proved with a similar strategy: one starts from a trivial approximator, for example the empty graph, and then repeats the following iteration: if the current approximator achieves the required approximation, then we are done; otherwise take a counterexample, and modify the approximator using the counterexample. Then one shows that:

- The number of iterations is bounded, by keeping track of an appropriate potential function;
- The “complexity” of the approximator does not increase too much from iteration to iteration.

Typically, the number of iterations is , and the difference between the various results is given by whether at each iteration the “complexity” increases exponentially, or by a multiplicative factor, or by an additive term.

Like in the post on pseudorandom constructions, one can view such constructions as an online game between a “builder” and an “inspector,” except that now the online optimization algorithm will play the role of the builder, and the inspector is the one acting as an adversary. The bound on the number of rounds comes from the fact that the online optimization algorithms that we have seen so far achieve amortized error per round after rounds, so it takes rounds for the error bound to go below .

We will see that the abstract weak regularity lemma of my paper with Tulsiani and Vadhan (and hence the graph weak regularity lemma of Frieze and Kannan) can be immediately deduced from the theory developed in the previous post.

When I was preparing these notes, I was asked by several people if the same can be done for Szemeredi’s lemma. I don’t see a natural way of doing that. For such results, one should maybe use the online optimization techniques as a guide rather than as a black box. In general, iterative arguments (in which one constructs an object through a series of improvements) require the choice of a potential function, and an argument about how much the potential function changes at every step. The power of the FTRL method is that it creates the potential function and a big part of the analysis automatically and, even where it does not work directly, it can serve as an inspiration.

One could imagine a counterfactual history in which people first proved the weak regularity lemma using online optimization out of the box, as we do in this post, and then decided to try and use an L2 potential function and an iterative method to get the Szemeredi lemma, subsequently trying to see what happens if the potential function is entropy, thus discovering Jacob Fox’s major improvement on the “triangle removal lemma,” which involves the construction of an approximator that just approximates the number of triangles.

**1. A “vanilla” weak regularity lemma **

Frieze and Kannan proved the following basic result about graph approximations, which has a number of algorithmic applications. If is a set of vertices which is understood from the context, and are disjoint subsets of vertices, then let , that is, the boolean matrix such that iff and .

The *cut norm* of a matrix is

In the following we will identify a graph with its adjacency matrix.

Theorem 1Let be an graph on vertices and be an approximation parameter.Then there are sets and scalars , where , such that if we define

we have

We will prove the following more general version.

Theorem 2Let be a set, be a bounded function, be a family of functions mapping to and be an approximation parameter. Then there are functions in and scalars , with , such that if we definewe have

We could also, with the same proof, argue about a possibly infinite set with a measure such that is finite, and, after defining the inner product

we could prove the same conclusion of the theorem, with instead of as an error bound.

Here is the proof: run the FTRL algorithm with L2-squared regularizer in the setup in which the space of solutions is the set of all functions and the loss functions are linear. Every time the algorithm proposes a solution , if there is a function such that either or , the adversary will pick, respectively, or as a loss function . When the adversary has no such choice, we stop and the function is our desired approximation.

First of all, let us analyze the number of rounds. Here the maximum norm of the functions in is , so after rounds we have the regret bound

Now let us consider to be our offline solution: we have

which implies

Finally, recall that

where is the scaling constant in the definition of the regularizer ( is of order of when is order of ), and so our final approximator computed at the last round is a weighted sum of functions from .

**2. The weak regularity lemma **

Frieze and Kannan’s weak regularity lemma has the following form.

Theorem 3Let be an graph on vertices and be an approximation parameter.Then there is a partition of into sets , and there are bounded weights for such that if we defined the weighted graph where the weight of the edge in is , where and , then we have

Notice that if we did not require the weights to be between 0 and 1 then the result of the previous section can also be cast in the above language, because we can take the partition to be the “Sigma-algebra generated by” the sets .

For a scalar , let be defined as

where stands for *t*runcation. Note that is the L2 projection of on .

Theorem 3 is a special case of the following result, proved in our paper with Tulsiani and Vadhan.

Theorem 4Let be a set, be a bounded function, be a family of functions mapping to and be an approximation parameter. Then there are functions in and scalars , with , such that if we definewe have

To prove Theorem 4 we play the same online game as in the previous section: the online algorithm proposes a solution ; if then we stop and output , otherwise we let the loss function be a function such that either or is in and

The only difference is that we use the FTRL algorithm with L2 regularizer that has the set feasible solutions defined to be the set of all functions rather than the set of all functions . Then each function is the projection to of , and the projection to is just composition with . The bound on the number of steps is the same as the one in the previous section.

Looking at the case in which is the set of edges of a clique on , is the set of graphs of the form , and considering the Sigma-algebra generated by gives Theorem 3 from Theorem 4.

**3. Sampling High-Entropy Distributions **

Finally we discuss the application to sampling high-entropy distributions.

Suppose that is a distribution over of min-entropy , meaning that for every we have

where we think of the *entropy deficiency* as being small, such as a constant or

Let be a class of functions that we think of as being “efficient.” For example, could be the set of all functions computable by circuits of size for some size bound , such as, for example . We will assume that is in . Define

to be a bounded function . Fix an approximation parameter .

Then from Theorem 4 we have that there are functions , and scalars , all equal to for a certain parameter , such that if we define

Now define the probability distribution

Applying (1) to the case , we have

and we know that , so

and we can rewrite (2) as

and, finally

that is

which says that and are -indistinguishable by functions in . If we chose , for example, to be the class of functions computable by circuits of size , then and are -indistinguishable by circuits of size .

But is also samplable in a relatively efficient way using rejection sampling: pick a random , then output with probability and fail with probability . Repeat the above until the procedure does not fail. At each step, the probability of success is , so, assuming (because otherwise all of the above makes no sense) that, say, , the procedure succeeds on average in at most attempts. And if each is computable by a circuit of size , then is computable by a circuit of size .

The undesirable features of this result are that the complexity of sampling and the quality of indistinguishability depend exponentially on the randomness deficiency, and the sampling circuit is a non-uniform circuit that it’s not clear how to construct without advice. Impagliazzo’s recent results address both these issues.

]]>

Furthermore, it is not clear how we would generalize the ideas of multiplicative weights to the case in which the set of feasible solutions is anything other than the set of distributions.

Today we discuss the *“Follow the Regularized Leader”* method, which provides a framework to design and analyze online algorithms in a versatile and well-motivated way. We will then see how we can “discover” the definition and analysis of multiplicative weights, and how to “discover” another online algorithm which can be seen as a generalization of projected gradient descent (that is, one can derive the projected gradient descent algorithm and its analysis from this other online algorithm).

**1. Follow The Regularized Leader **

We will first state some results in full generality, making no assumptions on the set of feasible solutions or on the set of loss functions encountered by the algorithm at each step.

Let us try to define an online optimization algorithm from scratch. The solution proposed by the algorithm at time can only depend on the previous cost functions ; how should it depend on it? If the offline optimal solution is consistently better than all others at each time step, then we would like to be that solution, so we want to be a solution that would have worked well in the previous steps. The most extreme way of implementing this idea is the *Follow the Leader* algorithm (abbreviated FTL), in which we set the solution at time

to be the best solution for the previous steps. (Note that the algorithm does not prescribe what solution to use at step .)

It is possible for FTL to perform very badly. Consider for example the “experts” setting in which we analyzed multiplicative weights: the set of feasible solutions is the set of probability distributions over , and the cost functions are linear with coefficients . Suppose that and that . Then a possible run of the algorithm could be:

- ,
- ,
- ,
- ,
- ,

In which, after steps, the algorithm suffers a loss of while the offline optimum is . Thus, the regret is about , which compares very unfavorably to the regret of the multiplicative weight algorithm. For general , a similar example shows that the regret of FTL can be as high as about .

In the above bad example, the algorithm keeps “overfitting” to the past history: if an expert is a bit better than the others, the algorithm puts all its probability mass on that expert, and the algorithm keeps changing its mind at every step. Interestingly, this is the only failure mode of the algorithm.

Theorem 1 (Analysis of FTL)For any sequence of cost functions and any number of time steps , the FTL algorithm satisfies the regret bound

So that if the functions are Lipschitz with respect to a distance function on , then the only way for the regret to be large is for to typically be far, in that distance, from .

*Proof:* Recalling the definition of regret,

We will prove (1) by induction. The base case is just the definition of . Assuming that $latex {(1)}&fg=000000$ is true up to we have

where the middle step follows from the use of the inductive assumption, which gives

The above example and analysis suggest that we should modify FTL in such a way that the choices of the algorithm don’t change too much from step to step, and that the solution at time should be a compromise between optimizing with respect to previous cost functions and not changing too much from step to step.

In order to do this, we introduce a new function , called a *regularizer* (more on it later), and, at each step, we compute the solution

This algorithm is called *Follow the Regularized Leader* or FTRL. Typically, the function is chosen to be strictly convex and to take values that are rather big in magnitude. Then will be the unique minimum of and, at each subsequent step, will be selected in a way to balance the pull toward the minimum of and the pull toward the FTL solution . In particular, if is large in magnitude compared to each , the solution will not change too much from step to step.

We have the following analysis that makes no assumptions on , on the cost functions and on the regularizer (not even that the regularizer is convex).

Theorem 2 (Analysis of FTRL)For every sequence of cost functions and every regularizer function, the regret after steps of the FTRL algorithm is bounded as follows: for every ,where

*Proof:* Let us run for steps the FTRL algorithm with regularizer and cost functions , and call the solutions computed by the FTL algorithm.

Now consider the following mental experiment: we run the FTL algorithm for steps, with the sequence of cost functions , and we use as a first solution. Then we see that the solutions computed by the FTL algorithm will be precisely . The regret bound for FTL implies that, for every ,

Having established these results, the general recipe to solve an online optimization problem will be to find a regularizer function such that the minimum of “pulls away from” solutions that would make the FTL algorithm overfit, and such that there is a good balance between how big gets over (because we pay in the regret, where is the offline optimum) and how stable is the minimum of as varies.

**2. Negative-Entropy Regularization **

Let us consider again the “experts” setting, that is, the online optimization setup in which is the set of probability distributions over and the cost functions are linear with bounded coefficients.

The example we showed above showed that FTL will tend to put all the probability mass on one expert. We would like to choose a regularizer that fights this tendency by penalizing “concentrated” distributions and favoring “spread-out” distributions. This observation might trigger the thought that the *entropy* of a distribution is a good measure of how concentrated or spread out it is, although the entropy is actually higher for spread-out distribution and smaller for concentrated ones. So we will use as a regularizer *minus the entropy*, multiplied by an appropriate scaling factor:

(Entropy is usually defined using logarithms in base 2, but using natural logarithms will make it cleaner to take derivatives, and it only affects the constant factor .) With this choice of regularizer, we have

To compute the minimum of the above function we will use the method of Lagrange multipliers. Specialized to our setting, the method of Lagrange multiplier states that if we want to solve the constrained minimization problem

we introduce a new parameter and define the function

Then it is possible to prove that if is a feasible minimizer of , then there is at least a value of such that , that is, such that is a stable point of . So one can proceed by finding all such that and then filtering out the values of such that , and finally looking at which of the remaining minimizes .

Ignoring for a moment the non-negativity constraints, the constraint reduces to , so we have to consider the function

The partial derivative of the above expression with respect to is

If we want the gradient to be zero then we want all the above expressions to be zero, which translates to

There is only one value of that makes the above solution a probability distribution, and the corresponding solution is

Notice that this is exactly the solution computed by the multiplicative weights algorithm, if we choose . So we have “rediscovered” the multiplicative weights algorithm and we have also “explained” what it does: at every step it balances the goals of finding a solution that is good for the past and that has large entropy.

Now it remains to bound, at each time step,

For this, it is convenient to return to the notation that we used in describing the multiplicative weights algorithm, that is, it is convenient to work with the weights defined as

so that, at each time step

We are assuming , so the weights are non-increasing with time. Then

For every we have , so

and

Putting it all together, we have

Choosing , we have

Thus, we have reconstructed the analysis of the multiplicative weights algorithm.

Interestingly, the analysis that we derived today is not exactly identical to the one from the post on multiplicative weights. There, we derived the bound

while here, setting , we derived

where is the offline optimum and is the entropy function (computed using natural logarithms).

**3. L2 Regularization **

Now that we have a general method, let us apply it to a new context: suppose that, as before, our cost functions are linear, but let . With linear cost functions and no bound on the size of solutions, it will not be possible to talk about regret with respect to the offline optimum, because the offline optimum will always be , but it will be possible to talk about regret with respect to a particular offline solution , which will already lead to interesting consequences.

What regularizer should we use? In reasoning about regularizers, it can be helpful to think about what would go wrong if we use FTL, and then considering what regularizer would successfully “pull away” from the bad solutions found by FTL. In this context of linear loss functions and unbounded solutions, FTL will pick an infinitely big solution at each step, or, to be more precise, the “max” in the definition of FTL is undefined. To fight this tendency of FTL to go off to infinity, it makes sense for the regularizer to be a measure of how big a solution is. Since we are going to have to compute derivatives, it is good to use a measure of “bigness” with a nice gradient, and is a natural choice. So, for a scale parameter to be optimized later, our regularizer will be

This tells us that

and

The function that we are minimizing in the above expression is convex, so we just have to compute the gradient and set it to zero

Which can be also expressed as

This makes perfect sense because, in the “experts” interpretation, we want to penalize the experts that performed badly in the past. Here we have no constraints on our allocations, so we simply decrease (additively this time, not multiplicatively) the allocation to the experts that caused a higher loss.

To compute the regret bound, we have

and so the regret with respect to a solution is

If we know a bound

then we can optimize and we have

** 3.1. Dealing with Constraints **

Consider now the case in which the loss functions are linear and is an arbitrary convex set. Using the same regularizer we have the algorithm

How can we solve the above constrained optimization problem? A very helpful observation is that we can first solve the unconstrained optimization and then project on , that is we can proceed as follows:

and we claim that we always have . The fact that we can reduce a regularized constrained optimization problem to an unconstrained problem and a projection is part of a broader theory that we will describe in a later post. For now, we will limit to prove the equivalence in this specific setting. First of all, we already have an expression for , namely

Now the definition of is

In order to bound the regret, we have to compute

and since L2 projections cannot increase L2 distances, we have

So the regret bound is

If is an upper bound to , and is an upper bound to the norm of all the loss vectors, then

which can be optimized to

** 3.2. Deriving the Analysis of Gradient Descent **

Suppose that is a convex function whose gradient is well defined at all points in , and that we are interested in minimizing . Then a way to reduce this problem to online optimization would be to use the function as loss function at each step. Then the offline optimum would be the minimizer of , and achieving small regret means that is close to the minimum of , and so the best is an approximate minimizer.

Unfortunately, this is not a very helpful idea, because if we ran an FTRL algorithm against an adversary that keeps proposing as a cost function at each step then we would have

which, for large , is essentially the same problem as minimizing , so we have basically reduced the problem of minimizing to itself.

Indeed, the power of the FTRL algorithm is that the algorithm does well even though it does not know the cost function, and if we keep using the same cost function at each step we are not making a good use of its power. Now, suppose that we use cost functions such that

Then, after steps, we have

meaning

and so one of the is an approximate minimizer. Indeed, using convexity, we also have

and so the average of the is also an approximate minimizer. From the point of view of exploiting FTRL do to minimize , cost functions as above work just as well as presenting as a cost functions at each step.

How do we find cost functions that satisfy the above two properties and for which the FTRL algorithm is easy to implement? The idea is to let be the linear approximation of at :

The condition is immediate, and

is a consequence of the convexity of .

The cost functions that we have defined are affine functions, that is, each of them equals a constant plus a linear function

Adding a constant term to a cost function does not change the iteration of FTRL, and does not change the regret (because the same term is added both to the solution found by the algorithm and to the offline optimum), so the algorithm is just initialized with

and then continues with the update rules

which is just projected gradient descent.

If we have known upper bounds

and

then we have

which means that to achieve additive error it is enough to proceed for steps.

]]>The method will yield constructions that are optimal in terms of the size of the pseudorandom set, but not very efficient, although there is at least one case (getting an “almost pairwise independent” pseudorandom generator) in which the method does something that I am not sure how to replicate with other techniques.

Mostly, the point of this post is to illustrate a concept that will reoccur in more interesting contexts: that we can use an online optimization algorithm in order to construct a combinatorial object satisfying certain desired properties. The idea is to run a game between a “builder” against an “inspector,” in which the inspector runs the online optimization algorithm with the goal of finding a violated property in what the builder is building, and the builder plays the role of the adversary selecting the cost functions, with the advantage that it gets to build a piece of the construction after seeing what property the “inspector” is looking for. By the regret analysis of the online optimization problem, if the builder did well at each round against the inspector, then it will do well also against the “offline optimum” that looks for a violated property after seeing the whole construction. For example, the construction of graph sparsifiers by Allen-Zhu, Liao and Orecchia can be cast in this framework.

(In some other applications, it will be the “builder” that runs the algorithm and the “inspector” who plays the role of the adversary. This will be the case of the Frieze-Kannan weak regularity lemma and of the Impagliazzo hard-core lemma. In those cases we capitalize on the fact that we know that there is a very good offline optimum, and we keep going for as long as the adversary is able to find violated properties in what the builder is constructing. After a sufficiently large number of rounds, the regret experienced by the algorithm would exceed the general regret bound, so the process must terminate in a small number of rounds. I have been told that this is just the “dual view” of what I described in the previous paragraph.)

But, back the pseudorandom sets: if is a collection of boolean functions , for example the functions computed by circuits of a certain type and a certain size, then a multiset is -pseudorandom for if, for every , we have

That is, sampling uniformly from , which we can do with random bits, is as good as sampling uniformly from , which requires bits, as far as the functions in are concerned.

It is easy to use Chernoff bounds and union bounds to argue that there is such a set of size , so that we can sample from it using only random bits.

We will prove this result (while also providing an “algorithm” for the construction) using multiplicative weights.

First of all, possibly by changing to , we may assume that for every function the function is also in . This simplifies things a bit because then the pseudorandom condition is equivalent to just

We will make up an “experts” setup in which there is an expert for each function . Thus, the algorithm, at each step, comes up with a probability distribution over the functions, which we can think of as a “probabilistic function.” At time , the adversary chooses a string and defines the cost function

where the adversary chooses an such that . At this point, the reader should try, without reading ahead, to establish:

- That such a choice of is always possible;
- That the cost function is of the form , where the loss vector satisfies , so that the regret after steps is ;
- That the sequence of choices by the adversary determines a -pseudorandom multiset for , and, in particular, we get an -pseudorandom multiset of cardinality

For the first point, note that for a random we have

so there is an such that

For the second point we just have to inspect the definition, and for the last point we have, by construction

so the regret bound is

which, after dividing by , is

Consider now the application of constructing a small-support distribution over that is -almost-pairwise-independent, meaning that if is a random string sampled according to this distribution, then, for every , the marginal is -close to the uniform distribution over in total variation distance. This is the same thing as asking for a small-support distribution that is -pseudorandom for all functions that depend on only two input variables. There are only such functions, so the above construction gives us a pseudorandom distribution that is uniform over a set of size , meaning that the distribution can be sampled using random bits. Furthermore the algorithm can be implemented to run in time . The only tricky step is how to find the string at each step. For a string , the loss obtained by choosing as the “reference string” is a polynomial of degree 2 in the bits of , so we can find a no-worse-than-average using the method of conditional expectations. I am not sure if there is a more standard way of doing this construction, perhaps one in which the bit of the -th string in the sample space can be generated in time . The standard approach is to combine a small-bias generator with a linear family of pairwise independent hash functions, but even using Ta-Shma’s construction of small-bias generators we would not get the correct dependency on . This framework can “derandomize Chernoff bounds” in other settings as well, such as randomized rounding of packing and covering integer linear programs, and it is basically the same thing as the method of “pessimistic estimators” described in the Motwani-Raghavan book on randomized algorithms.

]]>
*multiplicative weights* or *hedge* algorithm is the most well known and most frequently rediscovered algorithm in online optimization.

The problem it solves is usually described in the following language: we want to design an algorithm that makes the best possible use of the advice coming from self-described experts. At each time step , the algorithm has to decide with what probability to follow the advice of each of the experts, that is, the algorithm has to come up with a probability distribution where and . After the algorithm makes this choice, it is revealed that following the advice of expert at time leads to loss , so that the expected loss of the algorithm at time is . A loss can be negative, in which case its absolute value can be interpreted as a profit.

After steps, the algorithm “regrets” that it did not just always follow the advice of the expert that, with hindsight, was the best one, so that the regret of the algorithm after steps is

This corresponds to the instantiation of the framework we described in the previous post to the special case in which the set of feasible solutions is the set of probability distributions over the sample space and in which the loss functions are linear functions of the form . In order to bound the regret, we also have to bound the “magnitude” of the loss functions, so in the following we will assume that for all and all we have , and otherwise we can scale everything by a known upper bound on .

We now describe the algorithm.

The algorithm maintains at each step a vector of *weights* which is initialized as . The algorithm performs the following operations at time :

That is, the weight of expert at time is , and the probability of following the advice of expert at time is proportional to the weight. The parameter is hardwired into the algorithm and we will optimize it later. Note that the algorithm gives higher weight to experts that produced small losses (or negative losses of large absolute value) in the past, and thus puts higher probability on such experts.

We will prove the following bound.

Theorem 1Assuming that for all and we have , for every , after steps the multiplicative weight algorithm experiences a regret that is always bounded asIn particular, if , by setting we achieve a regret bound

We will start by giving a short proof of the above theorem.

For each time step , define the quantity

We want to prove that, roughly speaking, the only way for an adversary to make the algorithm incur a large loss is to produce a sequence of loss functions such that *even the best expert incurs a large loss*. The proof will work by showing that if the algorithm incurs a large loss after steps, then is small, and that if is small, then even the best expert incurs a large loss.

Let us define

to be the loss of the best expert. Then we have

Lemma 2 (If is small, then is large)

*Proof:* Let be an index such that . Then we have

Lemma 3 (If the loss of the algorithm is large then is small)

where is the vector whose -th coordinate is

*Proof:* Since we know that , it is enough to prove that, for every , we have

where we used the definitions of our quantities and the fact that for .

Using the fact that for all , the above lemmas can be restated as

and

which together imply

as desired.

Personally, I find all of the above very unsatisfactory, because both the algorithm and the analysis, but especially the analysis, seem to come out of nowhere. In fact, I never felt that I actually understood this analysis until I saw it presented as a special case of the *Follow The Regularized Leader* framework that we will discuss in a future post. (We will actually prove a slightly weaker bound, but with a much more satisfying proof.)

Here is, however, a story of how a statistical physicist might have invented the algorithm and might have come up with the analysis. Let’s call the loss caused by expert after steps the *energy* of expert at time :

Note that we have defined it in such a way that the algorithm knows at time . Our offline optimum is the energy of the lowest energy expert at time , that, is, the energy of the *ground state* at time . When we have a collection of numbers , a nice lower bound to their minimum is

which is true for every . The right-hand side above is the *free energy* at temperature at time . This seems like the kind of expression that we could use to bound the offline optimum, so let’s give it a name

In terms of coming up with an algorithm, all that we have got to work with at time are the losses of the experts at times . If the adversary chooses to make one of the experts consistently much better than the others, it is clear that, in order to get any reasonable regret bound, the algorithm will have to put much of the probability mass in most of the steps on that expert. This suggests that the should put higher probability on experts that have done well in the first steps, that is should put higher probability on “lower-energy” experts. When we have a system in which, at time , state has energy , a standard distribution that puts higher probability on lower energy states is the *Gibbs distribution* at temperature , defined as

where the denominator above is also called the *partition function* at time

So far we have “rediscovered” our multiplicative weights algorithm, and the quantity that we had in our analysis gets interpreted as the partition function . The fact that bounds the offline optimum suggests that we should use as a potential function, and aim for an analysis involving a telescoping sum. Indeed some manipulations (the same as in the short proof above, but which are now more mechanical) give that the loss of the algorithm at time is

which telescopes to give

Recalling that

and

we have again

As mentioned above, we will give a better story when we get to the *Follow The Regularized Leader* framework. In the next post, we will discuss complexity-theory consequences of the result we just proved.

A major reason for the exodus of the middle class from San Francisco, demographers say, is the high cost of housing, the highest in the mainland United States. Last month, the median cost of a dwelling in the San Francisco Standard Metropolitan Statistical Area was $129,000, according to the Federal Home Loan Bank Board in Washington, D.C. The comparable figure for New York, Newark and Jersey City was $90,400, and for Los Angeles, the second most expensive city, $118,400.

”This city dwarfs anything I’ve ever seen in terms of housing prices,” said Mr. Witte. Among factors contributing to high housing cost, according to Mr. Witte and others, is its relative scarcity, since the number of housing units has not grown significantly in a decade; the influx of Asians, whose first priority is usually to buy a home; the high incidence of adults with good incomes and no children, particularly homosexuals who pool their incomes to buy homes, and the desirability of San Francisco as a place to live.

$129,000 in 1981 dollars is $360,748 in 2019 dollars.

]]>
Again, the algorithm has to come up with a solution *without knowing what cost functions it is supposed to be optimizing*. Furthermore, we will think of the sequence of cost functions not as being fixed in advanced and unknown to the algorithm, but as being dynamically generated by an adversary, after seeing the solutions provided by the algorithm. (This resilience to adaptive adversaries will be important in most of the applications.)

The *offline optimum* after steps is the total cost that the best possible fixed solution would have incurred when evaluated against the cost functions seen by the algorithm, that is, it is a solution to

The *regret* after steps is the difference between the loss suffered by the algorithm and the offline optimum, that is,

The remarkable results that we will review give algorithms that achieve regret

that is, for fixed and , the regret-per-time-step goes to zero with the number of steps, as . It is intuitive that our bounds will have to depend on how big is the “diameter” of and how large is the “magnitude” and “smoothness” of the functions , but depending on how we choose to formalize these quantities we will be led to define different algorithms.

]]>

- The Barak-Hardt-Kale proof of the Impagliazzo hard-core lemma.
- The online convex optimization viewpoint on the Frieze-Kannan weak regularity lemma, on the dense model theorem of (RTTV), and on the abstract weak regularity lemma of (TTV) that were described to me by Madhur Tulsiani a few years ago. Furthermore, I wanted to see if Russel Impagliazzo’s subsequent improvements to the dense model theorem and to the abstract weak regularity lemma could be recovered from this point of view.
- The Arora-Kale algorithms for semidefinite programming, including their nearly linear-time algorithm for approximating the Goemans-Williamson relaxation of Max Cut.
- The meaning of the sentence “multiplicative weights and gradient descent are both special cases of follow-the-regularized-leader, using negative entropy and as regularizer, respectively.”
- The AllenZhu-Liao-Orecchia online optimization proof of the Batson-Spielman-Srivastava sparsification result.

I am happy to say that, except for the “furthermore” part of (2), I achieved my goals. To digest this material a bit better, I came up with the rather ambitious plan of writing a series of posts, in which I would alternate between (i) explaining a notion or theorem from online convex optimization (at a level that someone learning about optimization or machine learning might find useful) and (ii) explaining a complexity-theoretic application. Now that a very intense Spring semester is almost over, I plan to get started on this plan, although it is not clear that I will see it through the end. So stay tuned for the forthcoming first episode, which will be about the good old multiplicative weights algorithm.

]]>