Online Optimization Post 5: Bregman Projections and Mirror Descent

In this post we return to the generic form of the FTRL online optimization algorithm. If the cost functions are linear, as they will be in all the applications that I plan to talk about, the algorithm is:

\displaystyle x_t := \arg\min_{x\in K} \ R(x) + \sum_{k=1}^{t-1} \langle \ell_k, x \rangle \ \ \ \ \ (1)

where K\subseteq {\mathbb R}^n is the convex set of feasible solutions that the algorithm is allowed to produce, {x \rightarrow \langle \ell_k , x \rangle} is the linear loss function at time {k}, and {R: K \rightarrow {\mathbb R}} is the strictly convex regularizer.

If we have an unconstrained problem, that is, if {K= {\mathbb R}^n}, then the optimization problem (1) has a unique solution: the {x_t} such that

\displaystyle \nabla R(x_t) = - \sum_{k=1}^{t-1} \ell_k

and we can usually both compute {x_t} efficiently in an algorithm and reason about {x_t} effectively in an analysis.

Unfortunately, we are almost always interested in constrained settings, and then it becomes difficult both to compute {x_t} and to reason about it.

A very nice special case happens when the regularizer {R} acts as a barrier function for {K}, that is, the (norm of the) gradient of {R} goes to infinity when one approaches the boundary of {K}. In such a case, it is impossible for the minimum of (1) to occur at the boundary and the solution will be again the unique {x_t} in {K} such that

\displaystyle \nabla R(x_t) = - \sum_{k=1}^{t-1} \ell_k

We swept this point under the rug when we studied FTRL with negative-entropy regularizer in the settings of experts, in which {K = \Delta} is the set of probability distributions. When we proceeded to solve (1) using Lagrange multipliers, we ignored the non-negativity constraints. The reason why it was ok to do so was that the negative-entropy is a barrier function for the non-negative orthant {({\mathbb R}_{\geq 0})^n}.

Another important special case occurs when the regularizer {R(x) = c || x||^2} is a multiple of length-squared. In this case, we saw that we could “decouple” the optimization problem by first solving the unconstrained optimization problem, and then projecting the solution of the unconstrained problem to {K}:

\displaystyle y_{t} = \arg\min_{y\in {\mathbb R}^n} \ c || y||^2 + \sum_{k=1}^{t-1} \langle \ell_k, y \rangle

\displaystyle x_t = \Pi_K (y_t) = \arg\min _{x\in K} || x - y_t ||

Then we have the closed-form solution {y_t = - \frac 1{2c} \sum_{k=1}^{t-1} \ell _k} and, depending on the set {K}, the projection might also have a nice closed-form, as in the case {K= [0,1]^n} that comes up in results related to regularity lemmas.

As we will see today, this approach of solving the unconstrained problem and then projecting on {K} works for every regularizer, for an appropriate notion of projection called the Bregman projection (the projection will depend on the regularizer).

To define the Bregman projection, we will first define the Bregman divergence with respect to the regularizer {R}, which is a non-negative “distance” {D(x,y)} defined on {{\mathbb R}^n} (or possibly a subset of {{\mathbb R}^n} for which the regularizer {R} is a barrier function). Then, the Bregman projection of {y} on {K} is defined as {\arg\min_{x\in K} \ D(x,y)}.

Unfortunately, it is not so easy to reason about Bregman projections either, but the notion of Bregman divergence offers a way to reinterpret the FTRL algorithm from another point of view, called mirror descent. Via this reinterpretation, we will prove the regret bound

\displaystyle {\rm Regret}_T(x) \leq D(x,x_1) + \sum_{t=1}^T D(x_t,y_{t+1})

which carries the intuition that the regret comes from a combination of the “distance” of our initial solution from the offline optimum and of the “stability” of the algorithm, that is, the “distance” between consecutive soltuions. Nicely, the above bound measures both quantities using the same “distance” function.

Continue reading