# Online Optimization Post 4: Regularity Lemmas

We now discuss how to view proofs of certain regularity lemmas as applications of the FTRL methodology.

The Szemeredi Regularity Lemma states (in modern language) that every dense graph is well approximate by a graph with a very simple structure, made of the (edge-disjoint) union of a constant number of weighted complete bipartite subgraphs. The notion of approximation is a bit complicated to describe, but it enables the proof of counting lemmas, which show that, for example, the number of triangles in the original graph is well approximated by the (appropriately weighted) number of triangles in the approximating graph.

Analogous regularity lemmas, in which an arbitrary object is approximated by a low-complexity object, have been proved for hypergraphs, for subsets of abelian groups (for applications to additive combinatorics), in an analytic setting (for applications to graph limits) and so on.

The weak regularity lemma of Frieze and Kannan provides, as the name suggests, a weaker kind of approximation than the one promised by Szemeredi’s lemma, but one that is achievable with a graph that has a much smaller number of pieces. If ${\epsilon}$ is the “approximation error” that one is willing to tolerate, Szemeredi’s lemma constructs a graph that is the union of a ${2^{2^{\vdots}}}$ weighted complete bipartite subgraphs where the height of the tower of exponentials is polynomial in ${1/\epsilon}$. In the Frieze-Kannan construction, that number is cut down to a single exponential ${2^{O(1/\epsilon^2)}}$. This result too can be generalized to graph limits, subsets of groups, and so on.

With Tulsiani and Vadhan, we proved an abstract version of the Frieze-Kannan lemma (which can be applied to graphs, functions, distributions, etc.) in which the “complexity” of the approximation is ${O(1/\epsilon^2)}$. In the graph case, the approximating graph is still the union of ${2^{O(1/\epsilon^2)}}$ complete bipartite subgraphs, but it has a more compact representation. One consequence of this result is that for every high-min-entropy distribution ${\cal D}$, there is an efficiently samplable distribution with the same min-entropy as ${\cal D}$, that is indistinguishable from ${\cal D}$. Such a result could be taken to be a proof that what GANs attempt to achieve is possible in principle, except that our result requires an unrealistically high entropy (and we achieve “efficient samplability” and “indistinguishability” only in a weak sense).

All these results are proved with a similar strategy: one starts from a trivial approximator, for example the empty graph, and then repeats the following iteration: if the current approximator achieves the required approximation, then we are done; otherwise take a counterexample, and modify the approximator using the counterexample. Then one shows that:

• The number of iterations is bounded, by keeping track of an appropriate potential function;
• The “complexity” of the approximator does not increase too much from iteration to iteration.

Typically, the number of iterations is ${O(1/\epsilon^2)}$, and the difference between the various results is given by whether at each iteration the “complexity” increases exponentially, or by a multiplicative factor, or by an additive term.

Like in the post on pseudorandom constructions, one can view such constructions as an online game between a “builder” and an “inspector,” except that now the online optimization algorithm will play the role of the builder, and the inspector is the one acting as an adversary. The ${O(1/\epsilon^2)}$ bound on the number of rounds comes from the fact that the online optimization algorithms that we have seen so far achieve amortized error per round ${O(1/\sqrt T)}$ after ${T}$ rounds, so it takes ${O(1/\epsilon^2)}$ rounds for the error bound to go below ${\epsilon}$.

We will see that the abstract weak regularity lemma of my paper with Tulsiani and Vadhan (and hence the graph weak regularity lemma of Frieze and Kannan) can be immediately deduced from the theory developed in the previous post.

When I was preparing these notes, I was asked by several people if the same can be done for Szemeredi’s lemma. I don’t see a natural way of doing that. For such results, one should maybe use the online optimization techniques as a guide rather than as a black box. In general, iterative arguments (in which one constructs an object through a series of improvements) require the choice of a potential function, and an argument about how much the potential function changes at every step. The power of the FTRL method is that it creates the potential function and a big part of the analysis automatically and, even where it does not work directly, it can serve as an inspiration.

One could imagine a counterfactual history in which people first proved the weak regularity lemma using online optimization out of the box, as we do in this post, and then decided to try and use an L2 potential function and an iterative method to get the Szemeredi lemma, subsequently trying to see what happens if the potential function is entropy, thus discovering Jacob Fox’s major improvement on the “triangle removal lemma,” which involves the construction of an approximator that just approximates the number of triangles.

1. A “vanilla” weak regularity lemma

Frieze and Kannan proved the following basic result about graph approximations, which has a number of algorithmic applications. If ${V}$ is a set of vertices which is understood from the context, and ${A,B \subseteq V}$ are disjoint subsets of vertices, then let ${K_{A,B} = {\bf 1}_A \cdot {\bf 1}_B^T}$, that is, the boolean matrix such that ${K_{A,B} (i,j) = 1}$ iff ${i\in A}$ and ${j\in B}$.

The cut norm of a matrix ${M}$ is

$\displaystyle || M ||_{\square} := \max_{A,B} | \langle M, K_{A,B} \rangle |$

In the following we will identify a graph with its adjacency matrix.

Theorem 1 Let ${G=(V,E)}$ be an graph on ${n}$ vertices and ${\epsilon >0}$ be an approximation parameter.

Then there are sets ${A_1,B_1,\ldots,A_T,B_T}$ and scalars ${\alpha_1,\ldots,\alpha_T}$, where ${T \leq O(1/\epsilon^2)}$, such that if we define

$\displaystyle H:= \sum_{i=1}^T \alpha_i K_{A_i,B_i}$

we have

$\displaystyle || G - H ||_{\square} \leq \epsilon n^2$

We will prove the following more general version.

Theorem 2 Let ${X}$ be a set, ${g: X \rightarrow [0,1]}$ be a bounded function, ${{\cal F}}$ be a family of functions mapping ${X}$ to ${[0,1]}$ and ${\epsilon}$ be an approximation parameter. Then there are functions ${f_1,\ldots,f_T}$ in ${{\cal F}}$ and scalars ${\alpha_1,\ldots,\alpha_T}$, with ${T = O(1/\epsilon^2)}$, such that if we define

$\displaystyle h := \sum_{i=1}^T \alpha_i f_i$

we have

$\displaystyle \forall f\in {\cal F}: \ \ | \langle f, g- h \rangle | \leq \epsilon |X|$

We could also, with the same proof, argue about a possibly infinite set ${X}$ with a measure ${\mu}$ such that ${\mu(X)}$ is finite, and, after defining the inner product

$\displaystyle \langle f, g \rangle := \int_X f\cdot g\ d \mu \ ,$

we could prove the same conclusion of the theorem, with ${\epsilon \cdot \mu(X)}$ instead of ${\epsilon |X|}$ as an error bound.

Here is the proof: run the FTRL algorithm with L2-squared regularizer in the setup in which the space of solutions ${K}$ is the set of all functions ${ X \rightarrow {\mathbb R}}$ and the loss functions are linear. Every time the algorithm proposes a solution ${h_t}$, if there is a function ${f_t \in {\cal F}}$ such that either ${ \langle h_t - g , f_t \rangle > \epsilon|X|}$ or ${ \langle h_t - g , f_t \rangle < - \epsilon|X|}$, the adversary will pick, respectively, ${f_t}$ or ${-f_t}$ as a loss function ${\ell_t}$. When the adversary has no such choice, we stop and the function ${h_t}$ is our desired approximation.

First of all, let us analyze the number of rounds. Here the maximum norm of the functions in ${{\cal F}}$ is ${\sqrt {|X|}}$, so after ${T}$ rounds we have the regret bound

$\displaystyle \forall h : \ \ \sum_{t=1}^T \langle \ell_t, h_t - h \rangle \leq \sqrt{|X|} \cdot || h || \cdot \sqrt{2T}$

Now let us consider ${g}$ to be our offline solution: we have

$\displaystyle \epsilon T |X| < \sum_{t=1}^T \langle \ell_t, h_t - g \rangle \leq |X| \cdot \sqrt{2T}$

which implies

$\displaystyle T < \frac 2{\epsilon^2}$

Finally, recall that

$\displaystyle h_T = \sum_{t=1}^{T-1} - \frac 1 {2c} \ell_t$

where ${c}$ is the scaling constant in the definition of the regularizer (${1/2c}$ is of order of ${\epsilon}$ when ${T}$ is order of ${1/\epsilon^2}$), and so our final approximator ${h_T}$ computed at the last round is a weighted sum of functions from ${{\cal F}}$.

2. The weak regularity lemma

Frieze and Kannan’s weak regularity lemma has the following form.

Theorem 3 Let ${G=(V,E)}$ be an graph on ${n}$ vertices and ${\epsilon >0}$ be an approximation parameter.

Then there is a partition of ${V}$ into ${k = 2^{O(1/\epsilon^2)}}$ sets ${S_1,\ldots,S_k}$, and there are bounded weights ${0\leq p_{i,j} \leq 1}$ for ${i,j \in \{1,\ldots, k\}}$ such that if we defined the weighted graph ${H}$ where the weight of the edge ${(u,v)}$ in ${H}$ is ${p_{i,j}}$, where ${u\in S_i}$ and ${v\in S_j}$, then we have

$\displaystyle || G - H ||_{\square} \leq \epsilon n^2$

Notice that if we did not require the weights to be between 0 and 1 then the result of the previous section can also be cast in the above language, because we can take the partition ${S_1,\ldots,S_k}$ to be the “Sigma-algebra generated by” the sets ${A_1,B_1,\ldots,A_T,B_T}$.

For a scalar ${z\in {\mathbb R}}$, let ${\tau(z)}$ be defined as

$\displaystyle \tau(z) = \left\{ \begin{array}{rl} 0 & \mbox{ if } z <0\\ z & \mbox{ if } 0\leq z \leq 1\\ 1 & \mbox{ if } z > 1 \end{array} \right.$

where ${\tau}$ stands for truncation. Note that ${\tau(z)}$ is the L2 projection of ${z}$ on ${[0,1]}$.

Theorem 3 is a special case of the following result, proved in our paper with Tulsiani and Vadhan.

Theorem 4 Let ${X}$ be a set, ${g: X \rightarrow [0,1]}$ be a bounded function, ${{\cal F}}$ be a family of functions mapping ${X}$ to ${[0,1]}$ and ${\epsilon}$ be an approximation parameter. Then there are functions ${f_1,\ldots,f_T}$ in ${{\cal F}}$ and scalars ${\alpha_1,\ldots,\alpha_T}$, with ${T = O(1/\epsilon^2)}$, such that if we define

$\displaystyle h := \tau\left ( \sum_{i=1}^T \alpha_i f_i \right)$

we have

$\displaystyle \forall f\in {\cal F}: \ \ | \langle f, g- h \rangle | \leq \epsilon |X|$

To prove Theorem 4 we play the same online game as in the previous section: the online algorithm proposes a solution ${h_t}$; if ${\forall f\in {\cal F}: \ \ | \langle f, g- h \rangle | \leq \epsilon |X| }$ then we stop and output ${h=h_t}$, otherwise we let the loss function be a function ${\ell_t}$ such that either ${\ell_t}$ or ${-\ell_t}$ is in ${{\cal F}}$ and

$\displaystyle \langle \ell_t , g- h_t \rangle \geq \epsilon |X|$

The only difference is that we use the FTRL algorithm with L2 regularizer that has the set feasible solutions ${K}$ defined to be the set of all functions ${h : X \rightarrow [0,1]}$ rather than the set of all functions ${h: X \rightarrow {\mathbb R}}$. Then each function ${h_T}$ is the projection to ${K}$ of ${\sum_{t=1}^{T-1} - \frac 1 {2c}\ell_t}$, and the projection to ${K}$ is just composition with ${\tau}$. The bound on the number of steps is the same as the one in the previous section.

Looking at the case in which ${X}$ is the set of edges of a clique on ${V}$, ${{\cal F}}$ is the set of graphs of the form ${K_{A,B}}$, and considering the Sigma-algebra generated by ${A_1,B_1,\ldots,A_T,B_T}$ gives Theorem 3 from Theorem 4.

3. Sampling High-Entropy Distributions

Finally we discuss the application to sampling high-entropy distributions.

Suppose that ${D}$ is a distribution over ${\{ 0,1 \}^n}$ of min-entropy ${\geq n-d}$, meaning that for every ${x\in \{ 0,1 \}^n}$ we have

$\displaystyle D(x) \leq 2^{-(n-d)}$

where we think of the entropy deficiency ${d}$ as being small, such as a constant or ${O(\log n)}$

Let ${{\cal F}}$ be a class of functions ${\{ 0,1 \}^n \rightarrow \{ 0,1\}}$ that we think of as being “efficient.” For example, ${{\cal F}}$ could be the set of all functions computable by circuits of size ${\leq S(n)}$ for some size bound ${S(\cdot)}$, such as, for example ${S(n) = 10 n^2}$. We will assume that ${f(x) \equiv 1}$ is in ${{\cal F}}$. Define

$\displaystyle g(x) = 2^{n-d} \cdot D(x)$

to be a bounded function ${g: \{ 0,1 \}^n \rightarrow [0,1]}$. Fix an approximation parameter ${\epsilon >0}$.

Then from Theorem 4 we have that there are ${T = O(1/\epsilon^2)}$ functions ${f_1,\ldots,f_T}$, and scalars ${\alpha_1,\ldots,\alpha_T}$, all equal to ${\pm 1/2c}$ for a certain parameter ${c}$, such that if we define

$\displaystyle h(x) = \tau \left( \sum_{t=1}^T \alpha_t f_t(x) \right)$

we have

$\displaystyle \forall f \in {\cal F} : \ \ \ \left | \sum_{x\in \{ 0,1 \}^n} \left( g(x) f(x) - h(x) f(x) \right ) \right | \leq \epsilon 2^n \ \ \ \ \ (1)$

and so, multiplying by ${2^{-(n-d)}}$

$\displaystyle \forall f \in {\cal F} : \ \ \ \left | \sum_{x\in \{ 0,1 \}^n} \left( D(x) f(x) - h(x)2^{-(n-d)} f(x) \right ) \right | \leq \epsilon 2^d \ \ \ \ \ (2)$

Now define the probability distribution

$\displaystyle H(x) = \frac {h(x)}{\sum_{y\in \{ 0,1 \}^n} h(y) }$

Applying (1) to the case ${f(x) \equiv 1}$, we have

$\displaystyle \left | \sum_{x\in \{ 0,1 \}^n} \left( g(x) - h(x) \right ) \right | \leq \epsilon 2^n$

and we know that ${\sum_x g(x) = 2^{n-d} \sum_x D(x) = 2^{n-d}}$, so

$\displaystyle \left | 2^{n-d} - \sum_{x\in \{ 0,1 \}^n} h(x) \right | \leq \epsilon 2^n$

and we can rewrite (2) as

$\displaystyle \forall f \in {\cal F} : \ \ \ \left | \sum_{x\in \{ 0,1 \}^n} \left( D(x) f(x) - H(x) \cdot (\sum_y h(y)) 2^{-(n-d)} f(x) \right ) \right | \leq \epsilon 2^d$

and, finally

$\displaystyle \forall f \in {\cal F} : \ \ \ \left | \sum_{x\in \{ 0,1 \}^n} \left( D(x) f(x) - H(x) f(x) \right ) \right | \leq 2 \epsilon 2^d$

that is

$\displaystyle \forall f \in {\cal F} : \ \ \ \left | \Pr_{x\sim D} [f(x) = 1] - \Pr_{x\sim H} [f(x) =1 ] \right | \leq 2 \epsilon 2^d$

which says that ${H}$ and ${D}$ are ${\epsilon 2^{d+1}}$-indistinguishable by functions in ${{\cal F}}$. If we chose ${{\cal F}}$, for example, to be the class of functions computable by circuits of size ${\leq S(n)}$, then ${H}$ and ${D}$ are ${\epsilon 2^{d+1}}$-indistinguishable by circuits of size ${\leq S(n)}$.

But ${H}$ is also samplable in a relatively efficient way using rejection sampling: pick a random ${x}$, then output ${x}$ with probability ${h(x)}$ and fail with probability ${1-h(x)}$. Repeat the above until the procedure does not fail. At each step, the probability of success is ${\mathop{\mathbb E}_{x\in \{ 0,1 \}^n} h(x) \geq 2{-d} - \epsilon}$, so, assuming (because otherwise all of the above makes no sense) that, say, ${\epsilon < 2^{-d-1}}$, the procedure succeeds on average in at most ${2^{d+1}}$ attempts. And if each ${f_t}$ is computable by a circuit of size ${S(n)}$, then ${h}$ is computable by a circuit of size ${O(1/\epsilon^2) + S(n)}$.

The undesirable features of this result are that the complexity of sampling and the quality of indistinguishability depend exponentially on the randomness deficiency, and the sampling circuit is a non-uniform circuit that it’s not clear how to construct without advice. Impagliazzo’s recent results address both these issues.