In this post we return to the generic form of the FTRL online optimization algorithm. If the cost functions are linear, as they will be in all the applications that I plan to talk about, the algorithm is:
where is the convex set of feasible solutions that the algorithm is allowed to produce,
is the linear loss function at time
, and
is the strictly convex regularizer.
If we have an unconstrained problem, that is, if , then the optimization problem (1) has a unique solution: the
such that
and we can usually both compute efficiently in an algorithm and reason about
effectively in an analysis.
Unfortunately, we are almost always interested in constrained settings, and then it becomes difficult both to compute and to reason about it.
A very nice special case happens when the regularizer acts as a barrier function for
, that is, the (norm of the) gradient of
goes to infinity when one approaches the boundary of
. In such a case, it is impossible for the minimum of (1) to occur at the boundary and the solution will be again the unique
in
such that
We swept this point under the rug when we studied FTRL with negative-entropy regularizer in the settings of experts, in which is the set of probability distributions. When we proceeded to solve (1) using Lagrange multipliers, we ignored the non-negativity constraints. The reason why it was ok to do so was that the negative-entropy is a barrier function for the non-negative orthant
.
Another important special case occurs when the regularizer is a multiple of length-squared. In this case, we saw that we could “decouple” the optimization problem by first solving the unconstrained optimization problem, and then projecting the solution of the unconstrained problem to
:
Then we have the closed-form solution and, depending on the set
, the projection might also have a nice closed-form, as in the case
that comes up in results related to regularity lemmas.
As we will see today, this approach of solving the unconstrained problem and then projecting on works for every regularizer, for an appropriate notion of projection called the Bregman projection (the projection will depend on the regularizer).
To define the Bregman projection, we will first define the Bregman divergence with respect to the regularizer , which is a non-negative “distance”
defined on
(or possibly a subset of
for which the regularizer
is a barrier function). Then, the Bregman projection of
on
is defined as
.
Unfortunately, it is not so easy to reason about Bregman projections either, but the notion of Bregman divergence offers a way to reinterpret the FTRL algorithm from another point of view, called mirror descent. Via this reinterpretation, we will prove the regret bound
which carries the intuition that the regret comes from a combination of the “distance” of our initial solution from the offline optimum and of the “stability” of the algorithm, that is, the “distance” between consecutive soltuions. Nicely, the above bound measures both quantities using the same “distance” function.
1. Bregman Divergence and Bregman Projection
For a strictly convex function , we define the Bregman divergence associated to
as
that is, the difference between the value of at
and the value of the linear approximation of
at
(centered at
). By the strict convexity of
we have
and
iff
. These properties suggest that we may think of
as a kind of “distance” between
and
, which is a useful intuition although it is important to keep in mind that the divergence need not be symmetric and need not satisfy the triangle inequality.
Now we show that, assuming that is well defined and strictly convex on all
, and that the losses are linear, the constrained optimization problem (1) can be solved by first solving the unconstrained problem and then “projecting” the solution on
by finding the point in
of smallest Bregman divergence from the unconstrained optimum:
The proof is very simple. The optimum of the unconstrained optimization problem is the unique such that
that is, the unique such that
On the other hand, is defined as
that is,
where the second equality above follows from the fact that two functions that differ by a constant have the same optimal solutions.
Indeed we see that the above “decoupled” characterization of the FTRL algorithm would have worked for any definition of a function of the form
and that our particular choice of what “stuff dependent only on ” to add makes
which is reasonable for something that we want to think of as a “distance function.”
Note that, in all of the above, we can replace with a convex set
provided that
is a barrier function for
. In that case
is the unique such that
and everything else follows analogously.
2. Examples
2.1. Bregman Divergence of Length-Squared
If , then
so Bregman divergence is distance-squared, and Bregman projection is just (Euclidean) projection.
2.2. Bregman Divergence of Negative Entropy
If, for , we define
then the associated Bregman divergence is the generalized KL divergence.
where so that
Note that, if and
are probability distributions, then the final two terms above cancel out, leaving just the KL divergence
.
3. Mirror Descent
We now introduce a new perspective on FTRL.
In the unconstrained setting, if is a strictly convex function and
is the associated Bregman divergence, the mirror descent algorithm for online optimization has the update rule
The idea is that we want to find a solution that is good for the past loss functions, but that does not “overfit” too much. If, in past steps, had been chosen to be such a solution for the loss functions
, then, in choosing
, we want to balance staying close to
but also doing well with respect to
, hence the above definition.
Theorem 1 Initialized with
, the unconstrained mirror descent algorithm is identical to FTRL with regularizer
.
Proof: We will proceed by induction on . At
, the definition of
is the same. For larger
, we know that FTRL will choose the unique
such that
, so we will assume that this is true for the mirror descent algorithm for
and prove it for
.
First, we note that the function is strictly convex, because it equals
and so it is a sum of a strictly convex function , linear functions in
, and constants independent of
. This means that
is the unique point at which the gradient of the above function is zero, that is,
and so
and, using the inductive hypothesis, we have
as desired.
In the constrained case, there are two variants of mirror descent. Using the terminology from Elad Hazan’s survey, agile mirror descent is the natural generalization of the unconstrained algorithm:
Following the same steps as the proof in the previous section, it is possible to show that agile mirror descent is equivalent to solving, at each iteration, the “decoupled” optimization problems
That is, we can first solve the unconstrained problem and then project on . (Again, we can always replace
by a set
for which
is a barrier function and such that
.)
The lazy mirror descent algorithm has the update rule
The initialization is
Fact 2 Lazy mirror descent is equivalent to FTRL.
Proof: The solutions are the unconstrained optimum of FTRL, and
is the Bregman projection of
on
. We proved in the previous section that this characterizes constrained FTRL.
What about agile mirror descent? In certain special cases it is equivalent to lazy mirror descent, and hence to FTRL, but it usually leads to a different set of solutions.
We will provide an analysis of lazy mirror descent, but first we will give an analysis of the regret of unconstrained FTRL in terms of Bregman divergence, which will be the model on which we will build the proof for the constrained case.
4. A Regret Bound for FTRL in Terms of Bregman Divergence
In this section we prove the following regret bound.
Theorem 3 Unconstrained FTRL with regularizer
satisfies the regret bound
where
is the Bregman divergence associated with
.
We will take the mirror descent view of unconstrained FTRL, so that
We proved that
This means that we can rewrite the regret suffered at step with respect to
as
and the theorem follows by adding up the above expression for and recalling that
.
Unfortunately I have no geometric intuition about the above identity, although, as you can check yourself, the algebra works neatly.
5. A Regret Bound for Agile Mirror Descent
In this section we prove the following generalization of the regret bound from the previous section.
Theorem 4 Agile mirror descent satisfies the regret bound
The first part of the update rule of agile mirror descent is
and, following steps that we have already carried out before, satisfies
This means that we can rewrite the regret suffered at step with respect to
as
where the same mystery cancellations as before make the above identity true.
Now I will wield another piece of magic, and I will state without proof the following fact about Bregman projections
Lemma 5 If
and
is the Bregman projection on
of a point
, then
That is, if we think of as a “distance,” the distance from
to its closest point
in
plus the distance from
to
is at most the distance from
to
. Note that this goes in the opposite direction as the triangle inequality (which ok, because
typically does not satisfy the triangle inequality).
In particular, the above lemma gives us
and so
Now summing over and recalling that
we have our theorem.
Teşekkür ederim.
http://www.webdunya.com
Pingback: Online Optimization Post 7: Matrix Multiplicative Weights Update | in theory