In this post we return to the generic form of the FTRL online optimization algorithm. If the cost functions are linear, as they will be in all the applications that I plan to talk about, the algorithm is:

where is the convex set of feasible solutions that the algorithm is allowed to produce, is the linear loss function at time , and is the strictly convex regularizer.

If we have an unconstrained problem, that is, if , then the optimization problem (1) has a unique solution: the such that

and we can usually both compute efficiently in an algorithm and reason about effectively in an analysis.

Unfortunately, we are almost always interested in constrained settings, and then it becomes difficult both to compute and to reason about it.

A very nice special case happens when the regularizer acts as a *barrier function* for , that is, the (norm of the) gradient of goes to infinity when one approaches the boundary of . In such a case, it is impossible for the minimum of (1) to occur at the boundary and the solution will be again the unique in such that

We swept this point under the rug when we studied FTRL with negative-entropy regularizer in the settings of experts, in which is the set of probability distributions. When we proceeded to solve (1) using Lagrange multipliers, we ignored the non-negativity constraints. The reason why it was ok to do so was that the negative-entropy is a barrier function for the non-negative orthant .

Another important special case occurs when the regularizer is a multiple of length-squared. In this case, we saw that we could “decouple” the optimization problem by first solving the unconstrained optimization problem, and then projecting the solution of the unconstrained problem to :

Then we have the closed-form solution and, depending on the set , the projection might also have a nice closed-form, as in the case that comes up in results related to regularity lemmas.

As we will see today, this approach of solving the unconstrained problem and then projecting on works for every regularizer, for an appropriate notion of projection called the *Bregman projection* (the projection will depend on the regularizer).

To define the Bregman projection, we will first define the *Bregman divergence* with respect to the regularizer , which is a non-negative “distance” defined on (or possibly a subset of for which the regularizer is a barrier function). Then, the Bregman projection of on is defined as .

Unfortunately, it is not so easy to reason about Bregman projections either, but the notion of Bregman divergence offers a way to reinterpret the FTRL algorithm from another point of view, called *mirror descent*. Via this reinterpretation, we will prove the regret bound

which carries the intuition that the regret comes from a combination of the “distance” of our initial solution from the offline optimum and of the “stability” of the algorithm, that is, the “distance” between consecutive soltuions. Nicely, the above bound measures both quantities using the same “distance” function.

**1. Bregman Divergence and Bregman Projection **

For a strictly convex function , we define the *Bregman divergence* associated to as

that is, the difference between the value of at and the value of the linear approximation of at (centered at ). By the strict convexity of we have and iff . These properties suggest that we may think of as a kind of “distance” between and , which is a useful intuition although it is important to keep in mind that the divergence need not be symmetric and need not satisfy the triangle inequality.

Now we show that, assuming that is well defined and strictly convex on all , and that the losses are linear, the constrained optimization problem (1) can be solved by first solving the unconstrained problem and then “projecting” the solution on by finding the point in of smallest Bregman divergence from the unconstrained optimum:

The proof is very simple. The optimum of the unconstrained optimization problem is the unique such that

that is, the unique such that

On the other hand, is defined as

that is,

where the second equality above follows from the fact that two functions that differ by a constant have the same optimal solutions.

Indeed we see that the above “decoupled” characterization of the FTRL algorithm would have worked for any definition of a function of the form

and that our particular choice of what “stuff dependent only on ” to add makes which is reasonable for something that we want to think of as a “distance function.”

Note that, in all of the above, we can replace with a convex set provided that is a barrier function for . In that case

is the unique such that

and everything else follows analogously.

**2. Examples **

** 2.1. Bregman Divergence of Length-Squared **

If , then

so Bregman divergence is distance-squared, and Bregman projection is just (Euclidean) projection.

** 2.2. Bregman Divergence of Negative Entropy **

If, for , we define

then the associated Bregman divergence is the generalized *KL divergence.*

where so that

Note that, if and are probability distributions, then the final two terms above cancel out, leaving just the KL divergence .

**3. Mirror Descent **

We now introduce a new perspective on FTRL.

In the unconstrained setting, if is a strictly convex function and is the associated Bregman divergence, the *mirror descent* algorithm for online optimization has the update rule

The idea is that we want to find a solution that is good for the past loss functions, but that does not “overfit” too much. If, in past steps, had been chosen to be such a solution for the loss functions , then, in choosing , we want to balance staying close to but also doing well with respect to , hence the above definition.

Theorem 1Initialized with , the unconstrained mirror descent algorithm is identical to FTRL with regularizer .

*Proof:* We will proceed by induction on . At , the definition of is the same. For larger , we know that FTRL will choose the unique such that , so we will assume that this is true for the mirror descent algorithm for and prove it for .

First, we note that the function is strictly convex, because it equals

and so it is a sum of a strictly convex function , linear functions in , and constants independent of . This means that is the unique point at which the gradient of the above function is zero, that is,

and so

and, using the inductive hypothesis, we have

as desired.

In the constrained case, there are two variants of mirror descent. Using the terminology from Elad Hazan’s survey, *agile* mirror descent is the natural generalization of the unconstrained algorithm:

Following the same steps as the proof in the previous section, it is possible to show that agile mirror descent is equivalent to solving, at each iteration, the “decoupled” optimization problems

That is, we can first solve the unconstrained problem and then project on . (Again, we can always replace by a set for which is a barrier function and such that .)

The *lazy* mirror descent algorithm has the update rule

The initialization is

Fact 2Lazy mirror descent is equivalent to FTRL.

*Proof:* The solutions are the unconstrained optimum of FTRL, and is the Bregman projection of on . We proved in the previous section that this characterizes constrained FTRL.

What about agile mirror descent? In certain special cases it is equivalent to lazy mirror descent, and hence to FTRL, but it usually leads to a different set of solutions.

We will provide an analysis of lazy mirror descent, but first we will give an analysis of the regret of unconstrained FTRL in terms of Bregman divergence, which will be the model on which we will build the proof for the constrained case.

**4. A Regret Bound for FTRL in Terms of Bregman Divergence **

In this section we prove the following regret bound.

Theorem 3Unconstrained FTRL with regularizer satisfies the regret bound

where is the Bregman divergence associated with .

We will take the mirror descent view of unconstrained FTRL, so that

We proved that

This means that we can rewrite the regret suffered at step with respect to as

and the theorem follows by adding up the above expression for and recalling that .

Unfortunately I have no geometric intuition about the above identity, although, as you can check yourself, the algebra works neatly.

**5. A Regret Bound for Agile Mirror Descent **

In this section we prove the following generalization of the regret bound from the previous section.

Theorem 4Agile mirror descent satisfies the regret bound

The first part of the update rule of agile mirror descent is

and, following steps that we have already carried out before, satisfies

This means that we can rewrite the regret suffered at step with respect to as

where the same mystery cancellations as before make the above identity true.

Now I will wield another piece of magic, and I will state without proof the following fact about Bregman projections

Lemma 5If and is the Bregman projection on of a point , then

That is, if we think of as a “distance,” the distance from to its closest point in plus the distance from to is at most the distance from to . Note that this goes in the opposite direction as the triangle inequality (which ok, because typically does not satisfy the triangle inequality).

In particular, the above lemma gives us

and so

Now summing over and recalling that we have our theorem.