The multiplicative weights or hedge algorithm is the most well known and most frequently rediscovered algorithm in online optimization.
The problem it solves is usually described in the following language: we want to design an algorithm that makes the best possible use of the advice coming from self-described experts. At each time step , the algorithm has to decide with what probability to follow the advice of each of the experts, that is, the algorithm has to come up with a probability distribution where and . After the algorithm makes this choice, it is revealed that following the advice of expert at time leads to loss , so that the expected loss of the algorithm at time is . A loss can be negative, in which case its absolute value can be interpreted as a profit.
After steps, the algorithm “regrets” that it did not just always follow the advice of the expert that, with hindsight, was the best one, so that the regret of the algorithm after steps is
This corresponds to the instantiation of the framework we described in the previous post to the special case in which the set of feasible solutions is the set of probability distributions over the sample space and in which the loss functions are linear functions of the form . In order to bound the regret, we also have to bound the “magnitude” of the loss functions, so in the following we will assume that for all and all we have , and otherwise we can scale everything by a known upper bound on .
We now describe the algorithm.
The algorithm maintains at each step a vector of weights which is initialized as . The algorithm performs the following operations at time :
That is, the weight of expert at time is , and the probability of following the advice of expert at time is proportional to the weight. The parameter is hardwired into the algorithm and we will optimize it later. Note that the algorithm gives higher weight to experts that produced small losses (or negative losses of large absolute value) in the past, and thus puts higher probability on such experts.
We will prove the following bound.
Theorem 1 Assuming that for all and we have , for every , after steps the multiplicative weight algorithm experiences a regret that is always bounded as
In particular, if , by setting we achieve a regret bound
We will start by giving a short proof of the above theorem.
For each time step , define the quantity
We want to prove that, roughly speaking, the only way for an adversary to make the algorithm incur a large loss is to produce a sequence of loss functions such that even the best expert incurs a large loss. The proof will work by showing that if the algorithm incurs a large loss after steps, then is small, and that if is small, then even the best expert incurs a large loss.
Let us define
to be the loss of the best expert. Then we have
Lemma 2 (If is small, then is large)
Proof: Let be an index such that . Then we have
Lemma 3 (If the loss of the algorithm is large then is small)
where is the vector whose -th coordinate is
Proof: Since we know that , it is enough to prove that, for every , we have
And we see that
where we used the definitions of our quantities and the fact that for .
Using the fact that for all , the above lemmas can be restated as
which together imply
Personally, I find all of the above very unsatisfactory, because both the algorithm and the analysis, but especially the analysis, seem to come out of nowhere. In fact, I never felt that I actually understood this analysis until I saw it presented as a special case of the Follow The Regularized Leader framework that we will discuss in a future post. (We will actually prove a slightly weaker bound, but with a much more satisfying proof.)
Here is, however, a story of how a statistical physicist might have invented the algorithm and might have come up with the analysis. Let’s call the loss caused by expert after steps the energy of expert at time :
Note that we have defined it in such a way that the algorithm knows at time . Our offline optimum is the energy of the lowest energy expert at time , that, is, the energy of the ground state at time . When we have a collection of numbers , a nice lower bound to their minimum is
which is true for every . The right-hand side above is the free energy at temperature at time . This seems like the kind of expression that we could use to bound the offline optimum, so let’s give it a name
In terms of coming up with an algorithm, all that we have got to work with at time are the losses of the experts at times . If the adversary chooses to make one of the experts consistently much better than the others, it is clear that, in order to get any reasonable regret bound, the algorithm will have to put much of the probability mass in most of the steps on that expert. This suggests that the should put higher probability on experts that have done well in the first steps, that is should put higher probability on “lower-energy” experts. When we have a system in which, at time , state has energy , a standard distribution that puts higher probability on lower energy states is the Gibbs distribution at temperature , defined as
where the denominator above is also called the partition function at time
So far we have “rediscovered” our multiplicative weights algorithm, and the quantity that we had in our analysis gets interpreted as the partition function . The fact that bounds the offline optimum suggests that we should use as a potential function, and aim for an analysis involving a telescoping sum. Indeed some manipulations (the same as in the short proof above, but which are now more mechanical) give that the loss of the algorithm at time is
which telescopes to give
we have again
As mentioned above, we will give a better story when we get to the Follow The Regularized Leader framework. In the next post, we will discuss complexity-theory consequences of the result we just proved.
Very nice writeup!
The intuition that makes the analysis clear to me is the following:
– if an expert makes a mistake, its weight goes down by some factor.
– if we make a mistake, the _total_ weight goes down by some (other) factor.
– if we made much more mistakes than the expert, the total would go down so much that the expert’s weight would not fit in it 🙂
I think there is a typo right after you state Lemma 3, the product goes from 1 to T (and not n) as you are iterating over time and not the experts.
Pingback: Online Optimization Post 7: Matrix Multiplicative Weights Update | in theory