The *multiplicative weights* or *hedge* algorithm is the most well known and most frequently rediscovered algorithm in online optimization.

The problem it solves is usually described in the following language: we want to design an algorithm that makes the best possible use of the advice coming from self-described experts. At each time step , the algorithm has to decide with what probability to follow the advice of each of the experts, that is, the algorithm has to come up with a probability distribution where and . After the algorithm makes this choice, it is revealed that following the advice of expert at time leads to loss , so that the expected loss of the algorithm at time is . A loss can be negative, in which case its absolute value can be interpreted as a profit.

After steps, the algorithm “regrets” that it did not just always follow the advice of the expert that, with hindsight, was the best one, so that the regret of the algorithm after steps is

This corresponds to the instantiation of the framework we described in the previous post to the special case in which the set of feasible solutions is the set of probability distributions over the sample space and in which the loss functions are linear functions of the form . In order to bound the regret, we also have to bound the “magnitude” of the loss functions, so in the following we will assume that for all and all we have , and otherwise we can scale everything by a known upper bound on .

We now describe the algorithm.

The algorithm maintains at each step a vector of *weights* which is initialized as . The algorithm performs the following operations at time :

That is, the weight of expert at time is , and the probability of following the advice of expert at time is proportional to the weight. The parameter is hardwired into the algorithm and we will optimize it later. Note that the algorithm gives higher weight to experts that produced small losses (or negative losses of large absolute value) in the past, and thus puts higher probability on such experts.

We will prove the following bound.

Theorem 1Assuming that for all and we have , for every , after steps the multiplicative weight algorithm experiences a regret that is always bounded asIn particular, if , by setting we achieve a regret bound

We will start by giving a short proof of the above theorem.

For each time step , define the quantity

We want to prove that, roughly speaking, the only way for an adversary to make the algorithm incur a large loss is to produce a sequence of loss functions such that *even the best expert incurs a large loss*. The proof will work by showing that if the algorithm incurs a large loss after steps, then is small, and that if is small, then even the best expert incurs a large loss.

Let us define

to be the loss of the best expert. Then we have

Lemma 2 (If is small, then is large)

*Proof:* Let be an index such that . Then we have

Lemma 3 (If the loss of the algorithm is large then is small)

where is the vector whose -th coordinate is

*Proof:* Since we know that , it is enough to prove that, for every , we have

where we used the definitions of our quantities and the fact that for .

Using the fact that for all , the above lemmas can be restated as

and

which together imply

as desired.

Personally, I find all of the above very unsatisfactory, because both the algorithm and the analysis, but especially the analysis, seem to come out of nowhere. In fact, I never felt that I actually understood this analysis until I saw it presented as a special case of the *Follow The Regularized Leader* framework that we will discuss in a future post. (We will actually prove a slightly weaker bound, but with a much more satisfying proof.)

Here is, however, a story of how a statistical physicist might have invented the algorithm and might have come up with the analysis. Let’s call the loss caused by expert after steps the *energy* of expert at time :

Note that we have defined it in such a way that the algorithm knows at time . Our offline optimum is the energy of the lowest energy expert at time , that, is, the energy of the *ground state* at time . When we have a collection of numbers , a nice lower bound to their minimum is

which is true for every . The right-hand side above is the *free energy* at temperature at time . This seems like the kind of expression that we could use to bound the offline optimum, so let’s give it a name

In terms of coming up with an algorithm, all that we have got to work with at time are the losses of the experts at times . If the adversary chooses to make one of the experts consistently much better than the others, it is clear that, in order to get any reasonable regret bound, the algorithm will have to put much of the probability mass in most of the steps on that expert. This suggests that the should put higher probability on experts that have done well in the first steps, that is should put higher probability on “lower-energy” experts. When we have a system in which, at time , state has energy , a standard distribution that puts higher probability on lower energy states is the *Gibbs distribution* at temperature , defined as

where the denominator above is also called the *partition function* at time

So far we have “rediscovered” our multiplicative weights algorithm, and the quantity that we had in our analysis gets interpreted as the partition function . The fact that bounds the offline optimum suggests that we should use as a potential function, and aim for an analysis involving a telescoping sum. Indeed some manipulations (the same as in the short proof above, but which are now more mechanical) give that the loss of the algorithm at time is

which telescopes to give

Recalling that

and

we have again

As mentioned above, we will give a better story when we get to the *Follow The Regularized Leader* framework. In the next post, we will discuss complexity-theory consequences of the result we just proved.

Very nice writeup!

The intuition that makes the analysis clear to me is the following:

– if an expert makes a mistake, its weight goes down by some factor.

– if we make a mistake, the _total_ weight goes down by some (other) factor.

– if we made much more mistakes than the expert, the total would go down so much that the expert’s weight would not fit in it 🙂