In this post we return to the generic form of the FTRL online optimization algorithm. If the cost functions are linear, as they will be in all the applications that I plan to talk about, the algorithm is:
where is the convex set of feasible solutions that the algorithm is allowed to produce, is the linear loss function at time , and is the strictly convex regularizer.
If we have an unconstrained problem, that is, if , then the optimization problem (1) has a unique solution: the such that
and we can usually both compute efficiently in an algorithm and reason about effectively in an analysis.
Unfortunately, we are almost always interested in constrained settings, and then it becomes difficult both to compute and to reason about it.
A very nice special case happens when the regularizer acts as a barrier function for , that is, the (norm of the) gradient of goes to infinity when one approaches the boundary of . In such a case, it is impossible for the minimum of (1) to occur at the boundary and the solution will be again the unique in such that
We swept this point under the rug when we studied FTRL with negative-entropy regularizer in the settings of experts, in which is the set of probability distributions. When we proceeded to solve (1) using Lagrange multipliers, we ignored the non-negativity constraints. The reason why it was ok to do so was that the negative-entropy is a barrier function for the non-negative orthant .
Another important special case occurs when the regularizer is a multiple of length-squared. In this case, we saw that we could “decouple” the optimization problem by first solving the unconstrained optimization problem, and then projecting the solution of the unconstrained problem to :
Then we have the closed-form solution and, depending on the set , the projection might also have a nice closed-form, as in the case that comes up in results related to regularity lemmas.
As we will see today, this approach of solving the unconstrained problem and then projecting on works for every regularizer, for an appropriate notion of projection called the Bregman projection (the projection will depend on the regularizer).
To define the Bregman projection, we will first define the Bregman divergence with respect to the regularizer , which is a non-negative “distance” defined on (or possibly a subset of for which the regularizer is a barrier function). Then, the Bregman projection of on is defined as .
Unfortunately, it is not so easy to reason about Bregman projections either, but the notion of Bregman divergence offers a way to reinterpret the FTRL algorithm from another point of view, called mirror descent. Via this reinterpretation, we will prove the regret bound
which carries the intuition that the regret comes from a combination of the “distance” of our initial solution from the offline optimum and of the “stability” of the algorithm, that is, the “distance” between consecutive soltuions. Nicely, the above bound measures both quantities using the same “distance” function.
1. Bregman Divergence and Bregman Projection
For a strictly convex function , we define the Bregman divergence associated to as
that is, the difference between the value of at and the value of the linear approximation of at (centered at ). By the strict convexity of we have and iff . These properties suggest that we may think of as a kind of “distance” between and , which is a useful intuition although it is important to keep in mind that the divergence need not be symmetric and need not satisfy the triangle inequality.
Now we show that, assuming that is well defined and strictly convex on all , and that the losses are linear, the constrained optimization problem (1) can be solved by first solving the unconstrained problem and then “projecting” the solution on by finding the point in of smallest Bregman divergence from the unconstrained optimum:
The proof is very simple. The optimum of the unconstrained optimization problem is the unique such that
that is, the unique such that
On the other hand, is defined as
where the second equality above follows from the fact that two functions that differ by a constant have the same optimal solutions.
Indeed we see that the above “decoupled” characterization of the FTRL algorithm would have worked for any definition of a function of the form
and that our particular choice of what “stuff dependent only on ” to add makes which is reasonable for something that we want to think of as a “distance function.”
Note that, in all of the above, we can replace with a convex set provided that is a barrier function for . In that case
is the unique such that
and everything else follows analogously.
2.1. Bregman Divergence of Length-Squared
If , then
so Bregman divergence is distance-squared, and Bregman projection is just (Euclidean) projection.
2.2. Bregman Divergence of Negative Entropy
If, for , we define
then the associated Bregman divergence is the generalized KL divergence.
where so that
Note that, if and are probability distributions, then the final two terms above cancel out, leaving just the KL divergence .
3. Mirror Descent
We now introduce a new perspective on FTRL.
In the unconstrained setting, if is a strictly convex function and is the associated Bregman divergence, the mirror descent algorithm for online optimization has the update rule
The idea is that we want to find a solution that is good for the past loss functions, but that does not “overfit” too much. If, in past steps, had been chosen to be such a solution for the loss functions , then, in choosing , we want to balance staying close to but also doing well with respect to , hence the above definition.
Theorem 1 Initialized with , the unconstrained mirror descent algorithm is identical to FTRL with regularizer .
Proof: We will proceed by induction on . At , the definition of is the same. For larger , we know that FTRL will choose the unique such that , so we will assume that this is true for the mirror descent algorithm for and prove it for .
First, we note that the function is strictly convex, because it equals
and so it is a sum of a strictly convex function , linear functions in , and constants independent of . This means that is the unique point at which the gradient of the above function is zero, that is,
and, using the inductive hypothesis, we have
In the constrained case, there are two variants of mirror descent. Using the terminology from Elad Hazan’s survey, agile mirror descent is the natural generalization of the unconstrained algorithm:
Following the same steps as the proof in the previous section, it is possible to show that agile mirror descent is equivalent to solving, at each iteration, the “decoupled” optimization problems
That is, we can first solve the unconstrained problem and then project on . (Again, we can always replace by a set for which is a barrier function and such that .)
The lazy mirror descent algorithm has the update rule
The initialization is
Fact 2 Lazy mirror descent is equivalent to FTRL.
Proof: The solutions are the unconstrained optimum of FTRL, and is the Bregman projection of on . We proved in the previous section that this characterizes constrained FTRL.
What about agile mirror descent? In certain special cases it is equivalent to lazy mirror descent, and hence to FTRL, but it usually leads to a different set of solutions.
We will provide an analysis of lazy mirror descent, but first we will give an analysis of the regret of unconstrained FTRL in terms of Bregman divergence, which will be the model on which we will build the proof for the constrained case.
4. A Regret Bound for FTRL in Terms of Bregman Divergence
In this section we prove the following regret bound.
Theorem 3 Unconstrained FTRL with regularizer satisfies the regret bound
where is the Bregman divergence associated with .
We will take the mirror descent view of unconstrained FTRL, so that
We proved that
This means that we can rewrite the regret suffered at step with respect to as
and the theorem follows by adding up the above expression for and recalling that .
Unfortunately I have no geometric intuition about the above identity, although, as you can check yourself, the algebra works neatly.
5. A Regret Bound for Agile Mirror Descent
In this section we prove the following generalization of the regret bound from the previous section.
Theorem 4 Agile mirror descent satisfies the regret bound
The first part of the update rule of agile mirror descent is
and, following steps that we have already carried out before, satisfies
This means that we can rewrite the regret suffered at step with respect to as
where the same mystery cancellations as before make the above identity true.
Now I will wield another piece of magic, and I will state without proof the following fact about Bregman projections
Lemma 5 If and is the Bregman projection on of a point , then
That is, if we think of as a “distance,” the distance from to its closest point in plus the distance from to is at most the distance from to . Note that this goes in the opposite direction as the triangle inequality (which ok, because typically does not satisfy the triangle inequality).
In particular, the above lemma gives us
Now summing over and recalling that we have our theorem.