In which we show how to use expert advice, and introduce the powerful “multiplicative weight” algorithm.
We study the following online problem. We have “experts” that, at each time step , suggest a strategy about what to do at that time (for example, they might be advising on what technology to use, on what investments to make, they might make predictions on whether something is going to happen, thus requiring certain actions, and so on). Based on the quality of the advice that the experts offered in the past, we decide which advice to follow, or with what fraction of our investment to follow which strategy. Subsequently, we find out which loss or gain was associated to each strategy, and, in particular, what loss or gain we personally incurred with the strategy or mix of strategies that we picked, and we move to step .
We want to come up with an algorithm to use the expert advice such that, at the end, that is, at time , we are about as well off as if we had known in advance which expert was the one that gave the best advice, and we had always followed the strategy suggested by that expert at each step. Note that we make no probabilistic assumption, and our analysis will be a worst-case analysis over all possible sequences of events.
The “multiplicative update” algorithm provides a very good solution to this problem, and the analysis of this algorithm is a model for the several other applications of this algorithm, in rather different contexts.
1. A Simplified Setting
We begin with the following simplified setting: at each time step, we have to make a prediction about an event that has two possible outcomes, and we can use the advice of “experts,” which make predictions about the outcome at each step. Without knowing anything about the reliability of the experts, and without making any probabilistic assumption on the outcomes, we want to come up with a strategy that will lead us to make not much more mistakes than the “offline optimal” strategy of picking the expert which makes the fewest mistakes, and then always following the prediction of that optimal expert.
The algorithm works as follows: at each step , it assigns a weight to each expert , which measures the confidence that the algorithm has in the validity of the prediction of the expert. Initially, for all experts . Then the algorithm makes the prediction that is backed by the set of experts with largest total weight. For example, if the experts, and us, are trying to predict whether the following day it will rain or not, we will look at the sum of the weights of the experts that say it will rain, and the sum of the weights of the experts that say it will not, and then we agree with whichever prediction has the largest sum of weights. After the outcome is revealed, we divide by 2 the weight of the experts that were wrong, and leave the weight of the experts that were correct unchanged.
We now formalize the above algorithm in pseudocode. We use to denote the two possible outcomes of the event that we are required to predict at each step.
- for each do
- for each time
- let
- if the sum of over all the experts that predict is , then predict
- else predict
- wait until the outcome is revealed
- for each
- if was wrong then
To analyze the algorithm, let be the indicator variable that expert was wrong at time , that is, if the expert was wrong at time and otherwise. (Here stands for “mistake.”) Let be the total number of mistakes made by expert . Let be the indicator variable that our algorithm makes a mistake at time , and be the total number of mistakes made by our algorithm.
We make the following two observations:
- If the algorithm makes a mistake at time , then the total weight of the experts that are mistaken at time is , and, at the following step, the weight of those experts is divided by two, and this means that, if we make a mistake at time then
Because the initial total weight is , we have that, at the end,
- For each expert , the final weight is , and, clearly,
Together, the two previous observations mean that, for every expert ,
which means that, for every expert ,
That is, the number of mistakes made by the algorithm is at most a constant times the number of mistakes of the best expert, plus an extra mistakes.
We will now discuss an algorithm that improves the above result in two ways. We will show that, for every , the improved algorithm we can make the number of mistakes be at most for every , which can be seen to be optimal for small , and the improved algorithm will be able to handle a more general problem, in which the experts are suggesting arbitrary strategies, and the outcome of each strategy can be an arbitrary gain or loss.
2. The General Result
We now consider the following model. At each time step , each expert suggests a certain strategy. We choose to follow the advice of expert with probability , or, equivalently, we allocate a fraction of our resources in the way expert advised. Then we observe the outcome of the strategies suggested by the experts, and of our own strategy. We call the loss incurred by following the advice of expert . The loss can be negative, in which case it is a gain, and we normalize losses and gains so that for every and every . Our own loss for the time step will then be . At the end, we would like to say that our own sum of losses is not much higher than the sum of losses of the best expert.
As before, our algorithm maintains a weight for each expert, corresponding to our confidence in the expert. The weights are initialized to 1. When an expert causes a loss, we reduce his weight, and when an expert causes a gain, we increase his weight. To express the weight updated in a single instruction, we have , where is a parameter of our choice. Our probabilities are chosen proportionally to weights .
- for each do
- for each time
- let
- let
- for each , follow the strategy of expert with probability
- wait until the outcome is revealed
- let be the loss of the strategy of expert
- for each
To analyze the algorithm, we will need the following technical result.
Fact 1 For every ,
Proof: We will use the Taylor expansion
- The upper bound. The Taylor expansion above can be seen as , that is, equals plus a sum of terms that are all non-negative when . Thus, in particular, we have for .
- The lower bound for positive . We can also see that, for , we have
and so, for we have
- The lower bound for negative . Finally, for we have
and so, for we have
Now the analysis proceeds very similarly to the analysis in the previous section. We let
be the loss of the algorithm at time , and the total loss at the end. We denote by the total loss of expert .
If we look at the total weight at time , it is
and we can rewrite it as
Recalling that, initially, , we have that the total weight at the end is
For each expert , the weight of that expert at the end is
and, as before, we note that for every expert we have
Putting everything together, for every expert we have
Now it is just a matter of taking logarithms and of using the inequality that we proved before.
and, overall,
In the model of the previous section, at every step the loss of each expert is either 0 or 1, and so the above expression simplifies to
which shows that we can get arbitrarily close to the best expert.
In every case, (1) simplifies to
and, if we choose , we have
which means that we come close to the optimum up to a small additive error.
To see that this is essentially the best that we can hope for, consider a playing a fair roulette game as follows: for times, we either bet $1 on red or $1 on black. If we win we win $1, and if we lose we lose $1; we win and lose with probability 1/2 each at each step. Clearly, for every betting strategy, our expected win at the end is 0. We can think of the problem as there being two experts: the red expert always advises to bet red, and the black expert always advises to bet black. For each run of the game, the strategy of always following the best expert has a non-negative gain and, on average, following the best expert has a gain of , because there is probability that the best expert has a gain of . This means that we cannot hope to always achieve at least the gain of the best expert minus , even in a setting with 2 experts.
3. Applications
The general expert setting is very similar to a model of investments in which the experts correspond to stocks (or other investment vehicles) and the outcomes correspond to the variation in value of the stocks. The difference is that in our model we “invest” one unit of money at each step regardless of what happened in previous steps, while in investment strategies we compound our gains (and losses). If we look at the logarithm of the value of our investment, however, it is modeled correctly by the experts setting.
The multiplicative update algorithm that we described in the previous section arises in several other contexts, with a similar, or even identical, analysis. For example, it arises in the context of boosting in machine learning, and it leads to efficient approximate algorithms for certain special cases of linear programming.
Very interesting article. As a logician I’m particularly interested in the case when T=infinity. In particular, I’m interested in the case when the “experts” are really expert trolls and are intentionally trying to confuse us. For example, say that, in addition to trying to determine whether it will rain tomorrow, we also have a “secondary mission” which is to determine which expert is the most reliable. For simplicity, assume there are only two experts, A and B, both who have perfect knowledge of the weather. Rather than perfectly predict it, though, they conspire together so that A makes correct predictions and B makes incorrect ones until we eventually suspect A is the best expert. Then, they switch, A making incorrect predictions and B making correct predictions, until we suspect that B is the best expert. This continues indefinitely, forcing us to change our minds infinitely often. And contrary to intuition, it does NOT require the experts be able to see our guesses, provided we’ve nailed down the multiplicative weight algorithm or any other particular algorithm. The experts can merely run that algorithm themselves, allowing them to confuse us even if our guesses are invisible to them. It would be interesting to compute exactly what predictions these mischievous experts would make (this is completely determined, assuming, say, that by coincidence it rains every day).
Some corrections:
1. In both the simplified and general settings, it seems w_{i}^{1} should be 1, not 0.
2. The inequality signs (\leq, \geq) just above the statement of Eq. (1) should be reversed.
[Thanks, fixed now. — L.]