This is the seventh in a series of posts on online optimization, where we alternate one post explaining a result from the theory of online convex optimization and one post explaining an “application” in computational complexity or combinatorics. The first two posts were about the technique of Multiplicative Weights Updates and its application to “derandomizing” probabilistic arguments based on combining a Chernoff bound and a union bound. The third and fourth post were about the Follow-the-Regularized-Leader framework, which unifies multiplicative weights and gradient descent, and a “gradient descent view” of the Frieze-Kannan Weak Regularity Lemma. The fifth and sixth post were about the constrained version of the Follow-the-Regularized-Leader framework, and the Impagliazzo Hard-Core Set Lemma. Today we shall see the technique of Matrix Multiplicative Weights Updates.

**1. Matrix Multiplicative Weights Update **

In this post we consider the following generalization, introduced and studied by Arora and Kale, of the “learning from expert advice” setting and the multiplicative weights update method. In the “experts” model, we have a repeated game in which, at each time step , we have the option of following the advice of one of experts; if we follow the advice of expert at time , we incur a loss of , which is unknown to us (although, at time we know the loss functions ). We are allowed to choose a probabilistic strategy, whereby we follow the advice of expert with probability , so that our expected loss at time is .

In the matrix version, instead of choosing an expert we are allowed to choose a unit -dimensional vector , and the loss incurred in choosing the vector is , where is an unknown symmetric matrix. We are also allowed to choose a probabilistic strategy, so that with probability we choose the unit vector , and we incur the expected loss

The above expression can also be written as

where and we used the Frobenius inner product among square matrices defined as . The matrices that can be obtained as convex combinations of rank-1 matrices of the form where is a unit vector are called *density matrices* and can be characterized as the set of positive semidefinite matrices whose trace is 1.

It is possible to see the above game as the “quantum version” of the experts settings. A choice of a unit vector is a *pure quantum state*, a probability distribution of pure quantum states, described by a density matrix, is a *mixed quantum state*. If is a density matrix describing a mixed quantum state, is a symmetric matrix, and is the spectral decomposition of in terms of its eigenvalues and orthonormal eigenvectors , then is the expected outcome of a measurement of in the basis , and such that is the value of the measurement if the outcome is .

If you have no idea what the above paragraph means, that is perfectly ok because this view will not be particularly helpful in motivating the algorithm and analysis that we will describe. (Here I am reminded of the joke about the way people from Naples give directions: “How do I get to the post office?”, “Well, you see that road over there? After the a couple of blocks there is a pharmacy, where my uncle used to work, though now he is retired.” “Ok?” “Now, if you turn left after the pharmacy, after a while you get to a square with a big fountain and the church of St. Anthony where my niece got married. It was a beautiful ceremony, but the food at the reception was not great.” “Yes, I know that square”, “Good, don’t go there, the post office is not that way. Now, if you instead take that other road over there …”)

The main point of the above game, and of the Matrix Multiplicative Weights Update (MMWU) algorithm that plays it with bounded regret, is that it provides useful generalizations of the standard “experts” game and of the Multiplicative Weights Update (MWU) algorithm. For example, as we have already seen, MWU can provide a “derandomization” of the Chernoff bound; we will see that MMWU provides a derandomization of the *matrix* Chernoff bound. MWU can be used to approximate certain Linear Programming problems; MMWU can be used to approximate certain *Semidefinite Programming* problems.

To define and analyze the MMWU algorithm, we need to introduce certain operations on matrices. We will always work with real-valued symmetric matrices, but everything generalizes to complex-valued Hermitian matrices. If is a symmetric matrix, are the eigenvalues of , and are corresponding orthonormal eigenvectors, then we will define a number of operations and functions on that operate on the eigenvalues while leaving the eigenvectors unchanged.

The first operation is *matrix exponentiation*: we define

The operation always defines a positive definite matrix, and the resulting matrix satisfies a “Taylor expansion”

Indeed, it is more common to use the above expansion as the definition of the matrix exponential, and then derive the expression in terms of eigenvalues.

We also have the useful bounds

which is true for every and

which is true for all such that .

Analogously, if is positive definite, we can define

and we have a number of identities like , , , where is a scalar. We should be careful, however, not to take the analogy with real numbers too far: for example, if and are two symmetric matrices, in general it is not trues that , in fact the above expression is actually always false except when and commute, in which case it is trivially true. We have, however, the following extremely useful fact.

Theorem 1 (Golden-Thompson Inequality)

The Golden-Thompson inequality will be all we need to generalize to this matrix setting everything we have proved about multiplicative weights. See this post by Terry Tao for a proof.

The *Von Neumann entropy* of a density matrix with eigenvalues is defined as

that is, if we view as the mixed quantum state in which the pure state has probability , then is the entropy of the distribution over the pure states. Again, this is not a particularly helpful point of view, and in fact we will be interested in defining not just for density matrices but for arbitrary positive definite matrices, and even positive semidefinite (with the convention that , which is used also in the standard definition of entropy of a distribution).

We will be interested in using Von Neumann entropy as a regularizer, and hence we will want to know what is its Bregman divergence. Some calculations show that the Bregman divergence of the Von Neumann entropy, which is called the quantum relative entropy, is

If and are density matrices, the terms cancel out; the above definition is valid for arbitrary positive definite matrices.

We will have to study the minima of various functions that take a matrix as an input, so it is good to understand how to compute the gradient of such functions. For example what is the gradient of the function ? Working through the definition we see that , and indeed we always have that the gradient of the function is everywhere. Somewhat less obvious is the calculation of the gradient of the Von Neumann entropy, which is

**2. Analysis in the Constrained FTRL Framework **

Suppose that we play that we described above using agile mirror descent and using negative Von Neumann entropy (appropriately scaled) as a regularizer. That is, for some that we will choose later, we use the regularizer

which has the Bregman divergence

and our feasible set is the set of density matrices

To bound the regret, we just have to plug the above definitions into the machinery that we developed in our fifth post.

At time 1, we play the identity matrix scaled by n, which is a density matrix of maximum Von Neumann entropy :

At time , we play the matrix obtained as

and recall that we proved that, after steps,

If is a density matrix with eigenvalues , then the first term is

To complete the analysis we have to understand . We need to compute the gradient and set it to zero. The gradient of is just . The gradient of is

Meaning that we want to solve for

and satisfies

and we can write

Then we can use Golden-Thompson and the fact that , which holds if , to write

Combining everything together we have

and so, provided ,

This is the best bound we can hope for, and it matches Theorem 1 in our first post about the Xultiplicative Weights Update algorithm.

If we have , we can simplify it to

where the last step comes from optimizing .

We can also write, under the condition ,

where is the “absolute value” of the matrix defined in the following way: if is a symmetric matrix, then its absolute value is . Allen-Zhu, Liao and Orecchia state the analysis in this way in their on generalizations of Matrix Multiplicative Weights.

Our next post will discuss applications at length, but for now let us gain a bit of intuition about the usefulness of these regret bounds. Recall that, for every symmetric matrix , we have

and so the regret bound can be reintepreted in the following way: if we let be the loss functions used in a game played against a MMWU algorithm, and the algorithm selects density matrices , then

that is,

provided that . For example, switching with , we have

provided that , which means that if we can choose a sequence of loss matrices that make the MMWU have small loss at each step, then we are guaranteed that the sum of such matrices cannot have any large eigenvalue.