# Online Optimization Post 5: Bregman Projections and Mirror Descent

In this post we return to the generic form of the FTRL online optimization algorithm. If the cost functions are linear, as they will be in all the applications that I plan to talk about, the algorithm is:

$\displaystyle x_t := \arg\min_{x\in K} \ R(x) + \sum_{k=1}^{t-1} \langle \ell_k, x \rangle \ \ \ \ \ (1)$

where $K\subseteq {\mathbb R}^n$ is the convex set of feasible solutions that the algorithm is allowed to produce, ${x \rightarrow \langle \ell_k , x \rangle}$ is the linear loss function at time ${k}$, and ${R: K \rightarrow {\mathbb R}}$ is the strictly convex regularizer.

If we have an unconstrained problem, that is, if ${K= {\mathbb R}^n}$, then the optimization problem (1) has a unique solution: the ${x_t}$ such that

$\displaystyle \nabla R(x_t) = - \sum_{k=1}^{t-1} \ell_k$

and we can usually both compute ${x_t}$ efficiently in an algorithm and reason about ${x_t}$ effectively in an analysis.

Unfortunately, we are almost always interested in constrained settings, and then it becomes difficult both to compute ${x_t}$ and to reason about it.

A very nice special case happens when the regularizer ${R}$ acts as a barrier function for ${K}$, that is, the (norm of the) gradient of ${R}$ goes to infinity when one approaches the boundary of ${K}$. In such a case, it is impossible for the minimum of (1) to occur at the boundary and the solution will be again the unique ${x_t}$ in ${K}$ such that

$\displaystyle \nabla R(x_t) = - \sum_{k=1}^{t-1} \ell_k$

We swept this point under the rug when we studied FTRL with negative-entropy regularizer in the settings of experts, in which ${K = \Delta}$ is the set of probability distributions. When we proceeded to solve (1) using Lagrange multipliers, we ignored the non-negativity constraints. The reason why it was ok to do so was that the negative-entropy is a barrier function for the non-negative orthant ${({\mathbb R}_{\geq 0})^n}$.

Another important special case occurs when the regularizer ${R(x) = c || x||^2}$ is a multiple of length-squared. In this case, we saw that we could “decouple” the optimization problem by first solving the unconstrained optimization problem, and then projecting the solution of the unconstrained problem to ${K}$:

$\displaystyle y_{t} = \arg\min_{y\in {\mathbb R}^n} \ c || y||^2 + \sum_{k=1}^{t-1} \langle \ell_k, y \rangle$

$\displaystyle x_t = \Pi_K (y_t) = \arg\min _{x\in K} || x - y_t ||$

Then we have the closed-form solution ${y_t = - \frac 1{2c} \sum_{k=1}^{t-1} \ell _k}$ and, depending on the set ${K}$, the projection might also have a nice closed-form, as in the case ${K= [0,1]^n}$ that comes up in results related to regularity lemmas.

As we will see today, this approach of solving the unconstrained problem and then projecting on ${K}$ works for every regularizer, for an appropriate notion of projection called the Bregman projection (the projection will depend on the regularizer).

To define the Bregman projection, we will first define the Bregman divergence with respect to the regularizer ${R}$, which is a non-negative “distance” ${D(x,y)}$ defined on ${{\mathbb R}^n}$ (or possibly a subset of ${{\mathbb R}^n}$ for which the regularizer ${R}$ is a barrier function). Then, the Bregman projection of ${y}$ on ${K}$ is defined as ${\arg\min_{x\in K} \ D(x,y)}$.

Unfortunately, it is not so easy to reason about Bregman projections either, but the notion of Bregman divergence offers a way to reinterpret the FTRL algorithm from another point of view, called mirror descent. Via this reinterpretation, we will prove the regret bound

$\displaystyle {\rm Regret}_T(x) \leq D(x,x_1) + \sum_{t=1}^T D(x_t,y_{t+1})$

which carries the intuition that the regret comes from a combination of the “distance” of our initial solution from the offline optimum and of the “stability” of the algorithm, that is, the “distance” between consecutive soltuions. Nicely, the above bound measures both quantities using the same “distance” function.

# Online Optimization Post 4: Regularity Lemmas

We now discuss how to view proofs of certain regularity lemmas as applications of the FTRL methodology.

The Szemeredi Regularity Lemma states (in modern language) that every dense graph is well approximate by a graph with a very simple structure, made of the (edge-disjoint) union of a constant number of weighted complete bipartite subgraphs. The notion of approximation is a bit complicated to describe, but it enables the proof of counting lemmas, which show that, for example, the number of triangles in the original graph is well approximated by the (appropriately weighted) number of triangles in the approximating graph.

Analogous regularity lemmas, in which an arbitrary object is approximated by a low-complexity object, have been proved for hypergraphs, for subsets of abelian groups (for applications to additive combinatorics), in an analytic setting (for applications to graph limits) and so on.

The weak regularity lemma of Frieze and Kannan provides, as the name suggests, a weaker kind of approximation than the one promised by Szemeredi’s lemma, but one that is achievable with a graph that has a much smaller number of pieces. If ${\epsilon}$ is the “approximation error” that one is willing to tolerate, Szemeredi’s lemma constructs a graph that is the union of a ${2^{2^{\vdots}}}$ weighted complete bipartite subgraphs where the height of the tower of exponentials is polynomial in ${1/\epsilon}$. In the Frieze-Kannan construction, that number is cut down to a single exponential ${2^{O(1/\epsilon^2)}}$. This result too can be generalized to graph limits, subsets of groups, and so on.

With Tulsiani and Vadhan, we proved an abstract version of the Frieze-Kannan lemma (which can be applied to graphs, functions, distributions, etc.) in which the “complexity” of the approximation is ${O(1/\epsilon^2)}$. In the graph case, the approximating graph is still the union of ${2^{O(1/\epsilon^2)}}$ complete bipartite subgraphs, but it has a more compact representation. One consequence of this result is that for every high-min-entropy distribution ${\cal D}$, there is an efficiently samplable distribution with the same min-entropy as ${\cal D}$, that is indistinguishable from ${\cal D}$. Such a result could be taken to be a proof that what GANs attempt to achieve is possible in principle, except that our result requires an unrealistically high entropy (and we achieve “efficient samplability” and “indistinguishability” only in a weak sense).

All these results are proved with a similar strategy: one starts from a trivial approximator, for example the empty graph, and then repeats the following iteration: if the current approximator achieves the required approximation, then we are done; otherwise take a counterexample, and modify the approximator using the counterexample. Then one shows that:

• The number of iterations is bounded, by keeping track of an appropriate potential function;
• The “complexity” of the approximator does not increase too much from iteration to iteration.

Typically, the number of iterations is ${O(1/\epsilon^2)}$, and the difference between the various results is given by whether at each iteration the “complexity” increases exponentially, or by a multiplicative factor, or by an additive term.

Like in the post on pseudorandom constructions, one can view such constructions as an online game between a “builder” and an “inspector,” except that now the online optimization algorithm will play the role of the builder, and the inspector is the one acting as an adversary. The ${O(1/\epsilon^2)}$ bound on the number of rounds comes from the fact that the online optimization algorithms that we have seen so far achieve amortized error per round ${O(1/\sqrt T)}$ after ${T}$ rounds, so it takes ${O(1/\epsilon^2)}$ rounds for the error bound to go below ${\epsilon}$.

We will see that the abstract weak regularity lemma of my paper with Tulsiani and Vadhan (and hence the graph weak regularity lemma of Frieze and Kannan) can be immediately deduced from the theory developed in the previous post.

When I was preparing these notes, I was asked by several people if the same can be done for Szemeredi’s lemma. I don’t see a natural way of doing that. For such results, one should maybe use the online optimization techniques as a guide rather than as a black box. In general, iterative arguments (in which one constructs an object through a series of improvements) require the choice of a potential function, and an argument about how much the potential function changes at every step. The power of the FTRL method is that it creates the potential function and a big part of the analysis automatically and, even where it does not work directly, it can serve as an inspiration.

One could imagine a counterfactual history in which people first proved the weak regularity lemma using online optimization out of the box, as we do in this post, and then decided to try and use an L2 potential function and an iterative method to get the Szemeredi lemma, subsequently trying to see what happens if the potential function is entropy, thus discovering Jacob Fox’s major improvement on the “triangle removal lemma,” which involves the construction of an approximator that just approximates the number of triangles.

The multiplicative weights algorithm is simple to define and analyze, and it has several applications, but both its definition and its analysis seem to come out of nowhere. We mentioned that all the quantities arising in the algorithm and its analysis have statistical physics interpretations, but even this observation brings up more questions than it answers. The Gibbs distribution, for example, does put more weight on lower-energy states, and so it makes sense in an optimization setting, but to get good approximations one wants to use lower temperatures, while the distributions used by the multiplicative weights algorithms have temperature ${1/\epsilon}$, where ${2\epsilon}$ is the final “amortized” regret bound, so that one uses, quite counterintuitively, higher temperatures for better approximations.

Furthermore, it is not clear how we would generalize the ideas of multiplicative weights to the case in which the set of feasible solutions ${K}$ is anything other than the set of distributions.

Today we discuss the “Follow the Regularized Leader” method, which provides a framework to design and analyze online algorithms in a versatile and well-motivated way. We will then see how we can “discover” the definition and analysis of multiplicative weights, and how to “discover” another online algorithm which can be seen as a generalization of projected gradient descent (that is, one can derive the projected gradient descent algorithm and its analysis from this other online algorithm).

# Online Optimization Post 2: Constructing Pseudorandom Sets

Today we will see how to use the analysis of the multiplicative weights algorithm in order to construct pseudorandom sets.

The method will yield constructions that are optimal in terms of the size of the pseudorandom set, but not very efficient, although there is at least one case (getting an “almost pairwise independent” pseudorandom generator) in which the method does something that I am not sure how to replicate with other techniques.

Mostly, the point of this post is to illustrate a concept that will reoccur in more interesting contexts: that we can use an online optimization algorithm in order to construct a combinatorial object satisfying certain desired properties. The idea is to run a game between a “builder” against an “inspector,” in which the inspector runs the online optimization algorithm with the goal of finding a violated property in what the builder is building, and the builder plays the role of the adversary selecting the cost functions, with the advantage that it gets to build a piece of the construction after seeing what property the “inspector” is looking for. By the regret analysis of the online optimization problem, if the builder did well at each round against the inspector, then it will do well also against the “offline optimum” that looks for a violated property after seeing the whole construction. For example, the construction of graph sparsifiers by Allen-Zhu, Liao and Orecchia can be cast in this framework.

(In some other applications, it will be the “builder” that runs the algorithm and the “inspector” who plays the role of the adversary. This will be the case of the Frieze-Kannan weak regularity lemma and of the Impagliazzo hard-core lemma. In those cases we capitalize on the fact that we know that there is a very good offline optimum, and we keep going for as long as the adversary is able to find violated properties in what the builder is constructing. After a sufficiently large number of rounds, the regret experienced by the algorithm would exceed the general regret bound, so the process must terminate in a small number of rounds. I have been told that this is just the “dual view” of what I described in the previous paragraph.)

But, back the pseudorandom sets: if ${{\cal C} = \{ C_1,\ldots,C_N \}}$ is a collection of boolean functions ${C_i : \{ 0,1 \}^n \rightarrow \{ 0,1 \}}$, for example the functions computed by circuits of a certain type and a certain size, then a multiset ${S\subseteq \{ 0,1 \}^n}$ is ${\epsilon}$-pseudorandom for ${\cal C}$ if, for every ${C_i \in \cal C}$, we have

$\displaystyle | \mathop{\mathbb P}_{u \sim \{ 0,1 \}^n} [ C_i (u) =1] - \mathop{\mathbb P}_{s \sim S} [C_i(s) = 1 ] | \leq \epsilon$

That is, sampling uniformly from ${S}$, which we can do with ${\log_2 |S|}$ random bits, is as good as sampling uniformly from ${\{ 0,1 \}^n}$, which requires ${n}$ bits, as far as the functions in ${\cal C}$ are concerned.

It is easy to use Chernoff bounds and union bounds to argue that there is such a set of size ${O((\log N)/\epsilon^2)}$, so that we can sample from it using only ${\log\log N + 2\log \frac 1 \epsilon + O(1)}$ random bits.

We will prove this result (while also providing an “algorithm” for the construction) using multiplicative weights.

# Online Optimization Post 1: Multiplicative Weights

The multiplicative weights or hedge algorithm is the most well known and most frequently rediscovered algorithm in online optimization.

The problem it solves is usually described in the following language: we want to design an algorithm that makes the best possible use of the advice coming from ${n}$ self-described experts. At each time step ${t=1,2,\ldots}$, the algorithm has to decide with what probability to follow the advice of each of the experts, that is, the algorithm has to come up with a probability distribution ${x_t = (x_t(1),\ldots,x_t(n))}$ where ${x_t (i) \geq 0}$ and ${\sum_{i=1}^n x_t(i)=1}$. After the algorithm makes this choice, it is revealed that following the advice of expert ${i}$ at time ${t}$ leads to loss ${\ell_t (i)}$, so that the expected loss of the algorithm at time ${t}$ is ${\sum_{i=1}^n x_t(i) \ell_t (i)}$. A loss can be negative, in which case its absolute value can be interpreted as a profit.

After ${T}$ steps, the algorithm “regrets” that it did not just always follow the advice of the expert that, with hindsight, was the best one, so that the regret of the algorithm after ${T}$ steps is

$\displaystyle {\rm Regret}_T = \left( \sum_{t=1}^T\sum_{i=1}^n x_t(i) \ell_t(i) \right) - \left( \min_{i=1,\ldots,n} \ \ \sum_{t=1}^T \ell_t(i) \right)$

This corresponds to the instantiation of the framework we described in the previous post to the special case in which the set of feasible solutions ${K}$ is the set ${\Delta \subseteq {\mathbb R}^n}$ of probability distributions over the sample space ${\{ 1,\ldots,n\}}$ and in which the loss functions ${f_t (x)}$ are linear functions of the form ${f_t (x) = \sum_i x(i) \ell_t (i)}$. In order to bound the regret, we also have to bound the “magnitude” of the loss functions, so in the following we will assume that for all ${t}$ and all ${i}$ we have ${| \ell_t (i) | \leq 1}$, and otherwise we can scale everything by a known upper bound on ${\max_{t,i} |\ell_t |}$.

We now describe the algorithm.

The algorithm maintains at each step ${t}$ a vector of weights ${w_t = (w_t(1),\ldots,w_t(n))}$ which is initialized as ${w_1 := (1,\ldots,1)}$. The algorithm performs the following operations at time ${t}$:

• ${w_t (i) := w_{t-1} (i) \cdot e^{-\epsilon \ell_{t-1} (i) }}$
• ${x_t (i) := \displaystyle \frac {w_t (i) }{\sum_{j=1}^n w_t(j) }}$

That is, the weight of expert ${i}$ at time ${t}$ is ${e^{-\epsilon \sum_{k=1}^{t-1} \ell_k (i)}}$, and the probability ${x_t(i)}$ of following the advice of expert ${i}$ at time ${t}$ is proportional to the weight. The parameter ${\epsilon>0}$ is hardwired into the algorithm and we will optimize it later. Note that the algorithm gives higher weight to experts that produced small losses (or negative losses of large absolute value) in the past, and thus puts higher probability on such experts.

We will prove the following bound.

Theorem 1 Assuming that for all ${t}$ and ${i}$ we have ${| \ell_t(i) | \leq 1}$, for every ${0 < \epsilon < 1/2}$, after ${T}$ steps the multiplicative weight algorithm experiences a regret that is always bounded as

$\displaystyle {\rm Regret}_T \leq \epsilon \sum_{t=1}^T \sum_{i=1}^n x_t(i) \ell^2 _t (i) + \frac {\ln n}{\epsilon} \leq \epsilon T + \frac {\ln n}{\epsilon}$

In particular, if ${T > 4 \ln n}$, by setting ${\epsilon = \sqrt{\frac{\ln n}{T}}}$ we achieve a regret bound

$\displaystyle {\rm Regret}_T \leq 2 \sqrt{T \ln n}$

# Online Optimization Post 0: Definitions

Online convex optimization deals with the following setup: we want to design an algorithm that, at each discrete time step ${t=1,2,\ldots}$, comes up with a solution ${x_t \in K}$, where ${K}$ is a certain convex set of feasible solution. After the algorithm has selected its solution ${x_t}$, a convex cost function ${f_t : K \rightarrow {\mathbb R}}$, coming from a known restricted set of admissible cost functions ${{\cal F}}$, is revealed, and the algorithm pays the loss ${f_t (x_t)}$.

Again, the algorithm has to come up with a solution without knowing what cost functions it is supposed to be optimizing. Furthermore, we will think of the sequence of cost functions ${f_1,f_2, \ldots,f_t,\ldots}$ not as being fixed in advanced and unknown to the algorithm, but as being dynamically generated by an adversary, after seeing the solutions provided by the algorithm. (This resilience to adaptive adversaries will be important in most of the applications.)

The offline optimum after ${T}$ steps is the total cost that the best possible fixed solution would have incurred when evaluated against the cost functions seen by the algorithm, that is, it is a solution to

$\displaystyle \min_{x\in K} \ \ \sum_{t=1}^T f_t (x)$

The regret after ${T}$ steps is the difference between the loss suffered by the algorithm and the offline optimum, that is,

$\displaystyle {\rm Regret}_T = \sum_{t=1}^T f_t (x_t) - \min_{x\in K} \ \ \sum_{t=1}^T f_t (x)$

The remarkable results that we will review give algorithms that achieve regret

$\displaystyle {\rm Regret}_T \leq O_{K, {\cal F}} (\sqrt T)$

that is, for fixed ${K}$ and ${{\cal F}}$, the regret-per-time-step goes to zero with the number of steps, as ${O\left( \frac 1 {\sqrt T} \right)}$. It is intuitive that our bounds will have to depend on how big is the “diameter” of ${K}$ and how large is the “magnitude” and “smoothness” of the functions ${f\in {\cal F}}$, but depending on how we choose to formalize these quantities we will be led to define different algorithms.

# Online Optimization for Complexity Theorists

Last year I took some time off to study online convex optimization in some detail. The reason for doing that was similar to the reason why at some point I took time off to study spectral graph theory: it was coming up in several papers that I wanted to understand, and I felt that I was missing out by not mastering an important tool. In particular, I wanted to understand:

1. The Barak-Hardt-Kale proof of the Impagliazzo hard-core lemma.
2. The online convex optimization viewpoint on the Frieze-Kannan weak regularity lemma, on the dense model theorem of (RTTV), and on the abstract weak regularity lemma of (TTV) that were described to me by Madhur Tulsiani a few years ago. Furthermore, I wanted to see if Russel Impagliazzo’s subsequent improvements to the dense model theorem and to the abstract weak regularity lemma could be recovered from this point of view.
3. The Arora-Kale algorithms for semidefinite programming, including their nearly linear-time algorithm for approximating the Goemans-Williamson relaxation of Max Cut.
4. The meaning of the sentence “multiplicative weights and gradient descent are both special cases of follow-the-regularized-leader, using negative entropy and ${\ell_2^2}$ as regularizer, respectively.”
5. The AllenZhu-Liao-Orecchia online optimization proof of the Batson-Spielman-Srivastava sparsification result.

I am happy to say that, except for the “furthermore” part of (2), I achieved my goals. To digest this material a bit better, I came up with the rather ambitious plan of writing a series of posts, in which I would alternate between (i) explaining a notion or theorem from online convex optimization (at a level that someone learning about optimization or machine learning might find useful) and (ii) explaining a complexity-theoretic application. Now that a very intense Spring semester is almost over, I plan to get started on this plan, although it is not clear that I will see it through the end. So stay tuned for the forthcoming first episode, which will be about the good old multiplicative weights algorithm.