Online Optimization Post 7: Matrix Multiplicative Weights Update

This is the seventh in a series of posts on online optimization, where we alternate one post explaining a result from the theory of online convex optimization and one post explaining an “application” in computational complexity or combinatorics. The first two posts were about the technique of Multiplicative Weights Updates and its application to “derandomizing” probabilistic arguments based on combining a Chernoff bound and a union bound. The third and fourth post were about the Follow-the-Regularized-Leader framework, which unifies multiplicative weights and gradient descent, and a “gradient descent view” of the Frieze-Kannan Weak Regularity Lemma. The fifth and sixth post were about the constrained version of the Follow-the-Regularized-Leader framework, and the Impagliazzo Hard-Core Set Lemma. Today we shall see the technique of Matrix Multiplicative Weights Updates.

1. Matrix Multiplicative Weights Update

In this post we consider the following generalization, introduced and studied by Arora and Kale, of the “learning from expert advice” setting and the multiplicative weights update method. In the “experts” model, we have a repeated game in which, at each time step {t}, we have the option of following the advice of one of {n} experts; if we follow the advice of expert {i} at time {t}, we incur a loss of {\ell_t (i)}, which is unknown to us (although, at time {t} we know the loss functions {\ell_1(\cdot),\ldots,\ell_{t-1}(\cdot)}). We are allowed to choose a probabilistic strategy, whereby we follow the advice of expert {i} with probability {x_t(i)}, so that our expected loss at time {t} is {\sum_{i=1}^n x_t(i) \ell_t(i)}.

In the matrix version, instead of choosing an expert {i} we are allowed to choose a unit {n}-dimensional vector {v_t}, and the loss incurred in choosing the vector {v_t} is {v_t ^T L_t v_t}, where {L_t} is an unknown symmetric {n\times n} matrix. We are also allowed to choose a probabilistic strategy, so that with probability {x_t(j)} we choose the unit vector {v_t^{(j)}}, and we incur the expected loss

\displaystyle  \sum_j x_t (j) \cdot (v_t^{(j)})^T L_t v_t^{(j)}

The above expression can also be written as

\displaystyle  X_t \bullet L_t

where {X_t = \sum_j x_t(j) v_t^{(j)}(v_t^{(j)})^T} and we used the Frobenius inner product among square matrices defined as {A \bullet B = \sum_{i,j} A_{i,j} B_{i,j}}. The matrices {X_t} that can be obtained as convex combinations of rank-1 matrices of the form {vv^T} where {v} is a unit vector are called density matrices and can be characterized as the set of positive semidefinite matrices whose trace is 1.

It is possible to see the above game as the “quantum version” of the experts settings. A choice of a unit vector {v_t^{(j)}} is a pure quantum state, a probability distribution of pure quantum states, described by a density matrix, is a mixed quantum state. If {X} is a density matrix describing a mixed quantum state, {L} is a symmetric matrix, and {L = \sum_i \lambda_i w_i w_i^T} is the spectral decomposition of {L} in terms of its eigenvalues {\lambda_i} and orthonormal eigenvectors {w_i}, then {X \bullet L} is the expected outcome of a measurement of {X} in the basis {w_1,\ldots,w_n}, and such that {\lambda_i} is the value of the measurement if the outcome is {w_i}.

If you have no idea what the above paragraph means, that is perfectly ok because this view will not be particularly helpful in motivating the algorithm and analysis that we will describe. (Here I am reminded of the joke about the way people from Naples give directions: “How do I get to the post office?”, “Well, you see that road over there? After the a couple of blocks there is a pharmacy, where my uncle used to work, though now he is retired.” “Ok?” “Now, if you turn left after the pharmacy, after a while you get to a square with a big fountain and the church of St. Anthony where my niece got married. It was a beautiful ceremony, but the food at the reception was not great.” “Yes, I know that square”, “Good, don’t go there, the post office is not that way. Now, if you instead take that other road over there …”)

The main point of the above game, and of the Matrix Multiplicative Weights Update (MMWU) algorithm that plays it with bounded regret, is that it provides useful generalizations of the standard “experts” game and of the Multiplicative Weights Update (MWU) algorithm. For example, as we have already seen, MWU can provide a “derandomization” of the Chernoff bound; we will see that MMWU provides a derandomization of the matrix Chernoff bound. MWU can be used to approximate certain Linear Programming problems; MMWU can be used to approximate certain Semidefinite Programming problems.

To define and analyze the MMWU algorithm, we need to introduce certain operations on matrices. We will always work with real-valued symmetric matrices, but everything generalizes to complex-valued Hermitian matrices. If {M = \sum_i \lambda_i w_i w_i^T} is a symmetric matrix, {\lambda_i} are the eigenvalues of {X}, and {w_i} are corresponding orthonormal eigenvectors, then we will define a number of operations and functions on {X} that operate on the eigenvalues while leaving the eigenvectors unchanged.

The first operation is matrix exponentiation: we define

\displaystyle  e^X := \sum_i e^{\lambda_i} w_i w_i^T

The operation always defines a positive definite matrix, and the resulting matrix satisfies a “Taylor expansion”

\displaystyle  e^X = \sum_{k=0}^\infty \frac1 {k!} X^k

Indeed, it is more common to use the above expansion as the definition of the matrix exponential, and then derive the expression in terms of eigenvalues.

We also have the useful bounds

\displaystyle  e^X \succeq I + X

which is true for every {X} and

\displaystyle  e^X \preceq I + X +X^2

which is true for all {X} such that {X \preceq I}.

Analogously, if {X} is positive definite, we can define

\displaystyle  \log X := \sum_i (\log \lambda_i) \cdot w_i w_i^T

and we have a number of identities like {\log e^X = X}, {\log X^k = k \log X}, {e^{k X} = e^k \cdot X}, where {k} is a scalar. We should be careful, however, not to take the analogy with real numbers too far: for example, if {A} and {B} are two symmetric matrices, in general it is not trues that {e^{A+B} = e^A \cdot e^B}, in fact the above expression is actually always false except when {A} and {B} commute, in which case it is trivially true. We have, however, the following extremely useful fact.

Theorem 1 (Golden-Thompson Inequality)

\displaystyle  {\rm tr}(e^{A+B}) \leq {\rm tr}(e^A \cdot e^B)

The Golden-Thompson inequality will be all we need to generalize to this matrix setting everything we have proved about multiplicative weights. See this post by Terry Tao for a proof.

The Von Neumann entropy of a density matrix {X} with eigenvalues {\lambda_1,\cdots,\lambda_n} is defined as

\displaystyle  S(X) = \sum_i \lambda_i \log \frac 1 {\lambda_i} = - {\rm tr}(X\log X) = - X \bullet \log X

that is, if we view {X = \sum_i \lambda_i v_i v_i^T} as the mixed quantum state in which the pure state {v_i} has probability {\lambda_i}, then {S(X)} is the entropy of the distribution over the pure states. Again, this is not a particularly helpful point of view, and in fact we will be interested in defining {S(X)} not just for density matrices {X} but for arbitrary positive definite matrices, and even positive semidefinite (with the convention that {0 \log 0 = 0}, which is used also in the standard definition of entropy of a distribution).

We will be interested in using Von Neumann entropy as a regularizer, and hence we will want to know what is its Bregman divergence. Some calculations show that the Bregman divergence of the Von Neumann entropy, which is called the quantum relative entropy, is

\displaystyle  S(X_1|| X_2) = {\rm tr} (X_1 \cdot ( \log X_1 - \log X_2)) + {\rm tr}(X_2) - {\rm tr}(X_1)

\displaystyle  = X_1 \bullet(\log X_1 - \log X_2) + I \bullet (X_2 - X_1)

If {X_1} and {X_2} are density matrices, the terms {{\rm tr}(X_2) - {\rm tr}(X_1)} cancel out; the above definition is valid for arbitrary positive definite matrices.

We will have to study the minima of various functions that take a matrix as an input, so it is good to understand how to compute the gradient of such functions. For example what is the gradient of the function {{\rm tr}(X)}? Working through the definition we see that {\nabla {\rm tr}(X) = I}, and indeed we always have that the gradient of the function {X \rightarrow A\bullet X} is {A} everywhere. Somewhat less obvious is the calculation of the gradient of the Von Neumann entropy, which is

\displaystyle  \nabla (X \bullet \log X) = I + \log X

2. Analysis in the Constrained FTRL Framework

Suppose that we play that we described above using agile mirror descent and using negative Von Neumann entropy (appropriately scaled) as a regularizer. That is, for some {c} that we will choose later, we use the regularizer

\displaystyle  R(X) = c X \bullet \log X

which has the Bregman divergence

\displaystyle  D(X_1,X_2) = c S(X_1 || X_2) = c X_1 \bullet (\log X_1 - \log X_2) + cI \bullet (X_2 - X_1)

and our feasible set is the set of density matrices

\displaystyle  \Delta := \{ X\in {\mathbb R}^{n\times n} : X \succeq {\bf 0} \wedge {\rm tr}(X) = 1 \}

To bound the regret, we just have to plug the above definitions into the machinery that we developed in our fifth post.

At time 1, we play the identity matrix scaled by n, which is a density matrix of maximum Von Neumann entropy {\log n}:

\displaystyle  X_1 := \arg\min_{X \in \Delta} R(X) = \frac 1n I

At time {t+1\geq 2}, we play the matrix {X_t} obtained as

\displaystyle  \hat X_{t+1} = \arg\min_{X} D(X,X_{t}) + X\bullet L_{t}

\displaystyle  X_{t+1} = \arg\min_{X\in \Delta} D(X,\hat X_{t+1})

and recall that we proved that, after {T} steps,

\displaystyle  Regret_T(X) \leq D(X,X_1) + \sum_{t=1}^T D(X_t,\hat X_{t+1})

If {X} is a density matrix with eigenvalues {\lambda_1,\ldots,\lambda_n}, then the first term is

\displaystyle  D \left( X, \frac 1n I \right) = c X \bullet (\log X - \log n^{-1} I) = c \sum_i \lambda_i \log \frac n{\lambda_i} = c \log n - c \sum_i \lambda_i \frac 1 {\lambda_i} \leq c\log n

To complete the analysis we have to understand {\hat X_{t+1}}. We need to compute the gradient {X \rightarrow D(X,X_{t}) + X\bullet L_{t} } and set it to zero. The gradient of {X\bullet L_t} is just {L_t}. The gradient of {D(X,X_{t})} is

\displaystyle  \nabla D(X,X_t) = c \nabla X \bullet \log X - c \nabla X \bullet \log X_t - \nabla c X \bullet I

\displaystyle  = cI + c \log X - c \log X_t - cI

Meaning that we want to solve for

\displaystyle  c \log X - c\log X_t + L_t = 0

and {\hat X_{t+1}} satisfies

\displaystyle \log X_t - \log \hat X_{t+1} = \frac 1c L_t

\displaystyle  \hat X_{t+1} = e^{\log X_t - \frac 1c L_t }

and we can write

\displaystyle  D(X_t,\hat X_{t+1} ) = c \cdot \left( X_t \bullet (\log X_t - \log \hat X_{t+1})\right) + c {\rm tr}( \hat X_{t+1} )

\displaystyle  = c \cdot X_t \bullet \frac 1c L_t + c \cdot {\rm tr}(e^{\log X_t - \frac 1c L_t}) + c {\rm tr} \hat X_{t+1}

Then we can use Golden-Thompson and the fact that {e^-\frac 1c L_t \preceq I - \frac 1c L_t + \frac 1{c^2} L^2_t}, which holds if {L_t \preceq cI}, to write

\displaystyle  {\rm tr}(e^{\log X_t - \frac 1c L_t}) \leq {\rm tr}(e^{\log X_t} \cdot e^{-\frac 1c L_t} ) = X_t \bullet e^{-\frac 1c L_t} \leq X_t \bullet \left( I - \frac 1c L_t + \frac 1{c^2} L^2_t \right)

Combining everything together we have

\displaystyle  D(X_t,\hat X_{t+1} ) \leq \frac 1c X_t \bullet L_t^2

and so, provided {\lambda_{\max} (L_t) \leq c},

\displaystyle  Regret_T \leq c \log n + \frac 1c \sum_{t=1}^T X_t \bullet L_t^2

This is the best bound we can hope for, and it matches Theorem 1 in our first post about the Xultiplicative Weights Update algorithm.

If we have {L_t \preceq I}, we can simplify it to

\displaystyle  Regret_T \leq c \log n + \frac T c = 2 \sqrt{T \log n}

where the last step comes from optimizing {c}.

We can also write, under the condition {L_t \preceq c I},

\displaystyle  Regret_T (X) \leq c \log n + \frac 1c \sum_{t=1}^T (X_t \bullet |L_t| )||L_t||

where {|L_t|} is the “absolute value” of the matrix {L_t} defined in the following way: if {X = \sum_i \lambda_i v_i v_i^T} is a symmetric matrix, then its absolute value is {|X| = \sum_i |\lambda_i| \cdot v_i v_i^T}. Allen-Zhu, Liao and Orecchia state the analysis in this way in their on generalizations of Matrix Multiplicative Weights.

Our next post will discuss applications at length, but for now let us gain a bit of intuition about the usefulness of these regret bounds. Recall that, for every symmetric matrix {M}, we have

\displaystyle  \lambda_{\min} (M) = \min_{X \rm\ density\ matrix} \ \ X \bullet M

and so the regret bound can be reintepreted in the following way: if we let {L_1,\ldots,L_T} be the loss functions used in a game played against a MMWU algorithm, and the algorithm selects density matrices {X_1,\ldots,X_T}, then

\displaystyle  \sum_{t=1}^T X_t \bullet L_t - \min_{X \rm \ density \ matrix} \ \ X \bullet \sum_{t=1}^T L_t \leq c \log n + \frac 1c \sum_{t=1}^T X_t \bullet L_t^2

that is,

\displaystyle  \sum_{t=1}^T X_t \bullet L_t - \lambda_{\min} \left( \sum_{t=1}^T L_t \right) \leq c \log n + \frac 1c \sum_{t=1}^T X_t \bullet L_t^2

provided that {L_t \preceq cI}. For example, switching {L_t} with {-L_t}, we have

\displaystyle   \lambda_{\max} \left( \sum_{t=1}^T L_t \right) \leq \sum_{t=1}^T X_t \bullet L_t + \frac 1c \sum_{t=1}^T X_t \bullet L_t^2 + c\log n \ \ \ \ \ (1)

provided that {L_t \succeq -cI}, which means that if we can choose a sequence of loss matrices that make the MMWU have small loss at each step, then we are guaranteed that the sum of such matrices cannot have any large eigenvalue.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s