The multiplicative weights algorithm is simple to define and analyze, and it has several applications, but both its definition and its analysis seem to come out of nowhere. We mentioned that all the quantities arising in the algorithm and its analysis have statistical physics interpretations, but even this observation brings up more questions than it answers. The Gibbs distribution, for example, does put more weight on lower-energy states, and so it makes sense in an optimization setting, but to get good approximations one wants to use lower temperatures, while the distributions used by the multiplicative weights algorithms have temperature , where
is the final “amortized” regret bound, so that one uses, quite counterintuitively, higher temperatures for better approximations.
Furthermore, it is not clear how we would generalize the ideas of multiplicative weights to the case in which the set of feasible solutions is anything other than the set of distributions.
Today we discuss the “Follow the Regularized Leader” method, which provides a framework to design and analyze online algorithms in a versatile and well-motivated way. We will then see how we can “discover” the definition and analysis of multiplicative weights, and how to “discover” another online algorithm which can be seen as a generalization of projected gradient descent (that is, one can derive the projected gradient descent algorithm and its analysis from this other online algorithm).
1. Follow The Regularized Leader
We will first state some results in full generality, making no assumptions on the set of feasible solutions or on the set of loss functions
encountered by the algorithm at each step.
Let us try to define an online optimization algorithm from scratch. The solution proposed by the algorithm at time
can only depend on the previous cost functions
; how should it depend on it? If the offline optimal solution
is consistently better than all others at each time step, then we would like
to be that solution, so we want
to be a solution that would have worked well in the previous steps. The most extreme way of implementing this idea is the Follow the Leader algorithm (abbreviated FTL), in which we set the solution at time
to be the best solution for the previous steps. (Note that the algorithm does not prescribe what solution to use at step .)
It is possible for FTL to perform very badly. Consider for example the “experts” setting in which we analyzed multiplicative weights: the set of feasible solutions is the set
of probability distributions over
, and the cost functions are linear
with coefficients
. Suppose that
and that
. Then a possible run of the algorithm could be:
-
,
-
,
-
,
-
,
-
,
In which, after steps, the algorithm suffers a loss of
while the offline optimum is
. Thus, the regret is about
, which compares very unfavorably to the
regret of the multiplicative weight algorithm. For general
, a similar example shows that the regret of FTL can be as high as about
.
In the above bad example, the algorithm keeps “overfitting” to the past history: if an expert is a bit better than the others, the algorithm puts all its probability mass on that expert, and the algorithm keeps changing its mind at every step. Interestingly, this is the only failure mode of the algorithm.
Theorem 1 (Analysis of FTL) For any sequence of cost functions
and any number of time steps
, the FTL algorithm satisfies the regret bound
So that if the functions are Lipschitz with respect to a distance function on
, then the only way for the regret to be large is for
to typically be far, in that distance, from
.
Proof: Recalling the definition of regret,
We will prove (1) by induction. The base case is just the definition of
. Assuming that $latex {(1)}&fg=000000$ is true up to
we have
where the middle step follows from the use of the inductive assumption, which gives
The above example and analysis suggest that we should modify FTL in such a way that the choices of the algorithm don’t change too much from step to step, and that the solution at time
should be a compromise between optimizing with respect to previous cost functions and not changing too much from step to step.
In order to do this, we introduce a new function , called a regularizer (more on it later), and, at each step, we compute the solution
This algorithm is called Follow the Regularized Leader or FTRL. Typically, the function is chosen to be strictly convex and to take values that are rather big in magnitude. Then
will be the unique minimum of
and, at each subsequent step,
will be selected in a way to balance the pull toward the minimum of
and the pull toward the FTL solution
. In particular, if
is large in magnitude compared to each
, the solution will not change too much from step to step.
We have the following analysis that makes no assumptions on , on the cost functions
and on the regularizer (not even that the regularizer is convex).
Theorem 2 (Analysis of FTRL) For every sequence of cost functions and every regularizer function, the regret after
steps of the FTRL algorithm is bounded as follows: for every
,
where
Proof: Let us run for steps the FTRL algorithm with regularizer
and cost functions
, and call
the solutions computed by the FTL algorithm.
Now consider the following mental experiment: we run the FTL algorithm for steps, with the sequence of cost functions
, and we use
as a first solution. Then we see that the solutions computed by the FTL algorithm will be precisely
. The regret bound for FTL implies that, for every
,
Having established these results, the general recipe to solve an online optimization problem will be to find a regularizer function such that the minimum of
“pulls away from” solutions that would make the FTL algorithm overfit, and such that there is a good balance between how big
gets over
(because we pay
in the regret, where
is the offline optimum) and how stable is the minimum of
as
varies.
2. Negative-Entropy Regularization
Let us consider again the “experts” setting, that is, the online optimization setup in which is the set of probability distributions over
and the cost functions are linear
with bounded coefficients.
The example we showed above showed that FTL will tend to put all the probability mass on one expert. We would like to choose a regularizer that fights this tendency by penalizing “concentrated” distributions and favoring “spread-out” distributions. This observation might trigger the thought that the entropy of a distribution is a good measure of how concentrated or spread out it is, although the entropy is actually higher for spread-out distribution and smaller for concentrated ones. So we will use as a regularizer minus the entropy, multiplied by an appropriate scaling factor:
(Entropy is usually defined using logarithms in base 2, but using natural logarithms will make it cleaner to take derivatives, and it only affects the constant factor .) With this choice of regularizer, we have
To compute the minimum of the above function we will use the method of Lagrange multipliers. Specialized to our setting, the method of Lagrange multiplier states that if we want to solve the constrained minimization problem
we introduce a new parameter and define the function
Then it is possible to prove that if is a feasible minimizer of
, then there is at least a value of
such that
, that is, such that
is a stable point of
. So one can proceed by finding all
such that
and then filtering out the values of
such that
, and finally looking at which of the remaining
minimizes
.
Ignoring for a moment the non-negativity constraints, the constraint reduces to
, so we have to consider the function
The partial derivative of the above expression with respect to is
If we want the gradient to be zero then we want all the above expressions to be zero, which translates to
There is only one value of that makes the above solution a probability distribution, and the corresponding solution is
Notice that this is exactly the solution computed by the multiplicative weights algorithm, if we choose . So we have “rediscovered” the multiplicative weights algorithm and we have also “explained” what it does: at every step it balances the goals of finding a solution that is good for the past and that has large entropy.
Now it remains to bound, at each time step,
For this, it is convenient to return to the notation that we used in describing the multiplicative weights algorithm, that is, it is convenient to work with the weights defined as
so that, at each time step
We are assuming , so the weights are non-increasing with time. Then
For every we have
, so
and
Putting it all together, we have
Choosing , we have
Thus, we have reconstructed the analysis of the multiplicative weights algorithm.
Interestingly, the analysis that we derived today is not exactly identical to the one from the post on multiplicative weights. There, we derived the bound
while here, setting , we derived
where is the offline optimum and
is the entropy function (computed using natural logarithms).
3. L2 Regularization
Now that we have a general method, let us apply it to a new context: suppose that, as before, our cost functions are linear, but let . With linear cost functions and no bound on the size of solutions, it will not be possible to talk about regret with respect to the offline optimum, because the offline optimum will always be
, but it will be possible to talk about regret with respect to a particular offline solution
, which will already lead to interesting consequences.
What regularizer should we use? In reasoning about regularizers, it can be helpful to think about what would go wrong if we use FTL, and then considering what regularizer would successfully “pull away” from the bad solutions found by FTL. In this context of linear loss functions and unbounded solutions, FTL will pick an infinitely big solution at each step, or, to be more precise, the “max” in the definition of FTL is undefined. To fight this tendency of FTL to go off to infinity, it makes sense for the regularizer to be a measure of how big a solution is. Since we are going to have to compute derivatives, it is good to use a measure of “bigness” with a nice gradient, and is a natural choice. So, for a scale parameter
to be optimized later, our regularizer will be
This tells us that
and
The function that we are minimizing in the above expression is convex, so we just have to compute the gradient and set it to zero
Which can be also expressed as
This makes perfect sense because, in the “experts” interpretation, we want to penalize the experts that performed badly in the past. Here we have no constraints on our allocations, so we simply decrease (additively this time, not multiplicatively) the allocation to the experts that caused a higher loss.
To compute the regret bound, we have
and so the regret with respect to a solution is
If we know a bound
then we can optimize and we have
3.1. Dealing with Constraints
Consider now the case in which the loss functions are linear and is an arbitrary convex set. Using the same regularizer
we have the algorithm
How can we solve the above constrained optimization problem? A very helpful observation is that we can first solve the unconstrained optimization and then project on , that is we can proceed as follows:
and we claim that we always have . The fact that we can reduce a regularized constrained optimization problem to an unconstrained problem and a projection is part of a broader theory that we will describe in a later post. For now, we will limit to prove the equivalence in this specific setting. First of all, we already have an expression for
, namely
Now the definition of is
In order to bound the regret, we have to compute
and since L2 projections cannot increase L2 distances, we have
So the regret bound is
If is an upper bound to
, and
is an upper bound to the norm
of all the loss vectors, then
which can be optimized to
3.2. Deriving the Analysis of Gradient Descent
Suppose that is a convex function whose gradient
is well defined at all points in
, and that we are interested in minimizing
. Then a way to reduce this problem to online optimization would be to use the function
as loss function at each step. Then the offline optimum would be the minimizer of
, and achieving small regret means that
is close to the minimum of
, and so the best
is an approximate minimizer.
Unfortunately, this is not a very helpful idea, because if we ran an FTRL algorithm against an adversary that keeps proposing as a cost function at each step then we would have
which, for large , is essentially the same problem as minimizing
, so we have basically reduced the problem of minimizing
to itself.
Indeed, the power of the FTRL algorithm is that the algorithm does well even though it does not know the cost function, and if we keep using the same cost function at each step we are not making a good use of its power. Now, suppose that we use cost functions such that
Then, after steps, we have
meaning
and so one of the is an approximate minimizer. Indeed, using convexity, we also have
and so the average of the is also an approximate minimizer. From the point of view of exploiting FTRL do to minimize
, cost functions
as above work just as well as presenting
as a cost functions at each step.
How do we find cost functions that satisfy the above two properties and for which the FTRL algorithm is easy to implement? The idea is to let be the linear approximation of
at
:
The condition is immediate, and
is a consequence of the convexity of .
The cost functions that we have defined are affine functions, that is, each of them equals a constant plus a linear function
Adding a constant term to a cost function does not change the iteration of FTRL, and does not change the regret (because the same term is added both to the solution found by the algorithm and to the offline optimum), so the algorithm is just initialized with
and then continues with the update rules
which is just projected gradient descent.
If we have known upper bounds
and
then we have
which means that to achieve additive error it is enough to proceed for
steps.
Hi Luca, I think there’s a typo in “Negative Entropy Regularization” section. When you say:
“With this choice of regularizer, we have: $$ x_t = \argmin_{x \in Delta} \sum_{k=1}^{t-1} \langle \ell_k, x_k \rangle + … $$ ”
But the $k$ subscript shouldn’t be there for the x in the inner product, right? And also later, same thing: “the constraint {x \in \Delta} reduces to {\sum_i x_i = 1}, so we have to consider the function
$$ \left( \sum_{k=1}^{t-1} \langle \ell_k , x_k \rangle \right ) + … $$ “
Thanks for the posts, very educational and cool. You write great motivated math.
Pingback: Online Optimization Post 7: Matrix Multiplicative Weights Update | in theory