We now discuss how to view proofs of certain *regularity lemmas* as applications of the FTRL methodology.

The Szemeredi Regularity Lemma states (in modern language) that every dense graph is well approximate by a graph with a very simple structure, made of the (edge-disjoint) union of a constant number of weighted complete bipartite subgraphs. The notion of approximation is a bit complicated to describe, but it enables the proof of *counting lemmas*, which show that, for example, the number of triangles in the original graph is well approximated by the (appropriately weighted) number of triangles in the approximating graph.

Analogous regularity lemmas, in which an arbitrary object is approximated by a low-complexity object, have been proved for hypergraphs, for subsets of abelian groups (for applications to additive combinatorics), in an analytic setting (for applications to graph limits) and so on.

The *weak regularity lemma* of Frieze and Kannan provides, as the name suggests, a weaker kind of approximation than the one promised by Szemeredi’s lemma, but one that is achievable with a graph that has a much smaller number of pieces. If is the “approximation error” that one is willing to tolerate, Szemeredi’s lemma constructs a graph that is the union of a weighted complete bipartite subgraphs where the height of the tower of exponentials is polynomial in . In the Frieze-Kannan construction, that number is cut down to a single exponential . This result too can be generalized to graph limits, subsets of groups, and so on.

With Tulsiani and Vadhan, we proved an abstract version of the Frieze-Kannan lemma (which can be applied to graphs, functions, distributions, etc.) in which the “complexity” of the approximation is . In the graph case, the approximating graph is still the union of complete bipartite subgraphs, but it has a more compact representation. One consequence of this result is that for every high-min-entropy distribution , there is an efficiently samplable distribution with the same min-entropy as , that is indistinguishable from . Such a result could be taken to be a proof that what GANs attempt to achieve is possible in principle, except that our result requires an unrealistically high entropy (and we achieve “efficient samplability” and “indistinguishability” only in a weak sense).

All these results are proved with a similar strategy: one starts from a trivial approximator, for example the empty graph, and then repeats the following iteration: if the current approximator achieves the required approximation, then we are done; otherwise take a counterexample, and modify the approximator using the counterexample. Then one shows that:

- The number of iterations is bounded, by keeping track of an appropriate potential function;
- The “complexity” of the approximator does not increase too much from iteration to iteration.

Typically, the number of iterations is , and the difference between the various results is given by whether at each iteration the “complexity” increases exponentially, or by a multiplicative factor, or by an additive term.

Like in the post on pseudorandom constructions, one can view such constructions as an online game between a “builder” and an “inspector,” except that now the online optimization algorithm will play the role of the builder, and the inspector is the one acting as an adversary. The bound on the number of rounds comes from the fact that the online optimization algorithms that we have seen so far achieve amortized error per round after rounds, so it takes rounds for the error bound to go below .

We will see that the abstract weak regularity lemma of my paper with Tulsiani and Vadhan (and hence the graph weak regularity lemma of Frieze and Kannan) can be immediately deduced from the theory developed in the previous post.

When I was preparing these notes, I was asked by several people if the same can be done for Szemeredi’s lemma. I don’t see a natural way of doing that. For such results, one should maybe use the online optimization techniques as a guide rather than as a black box. In general, iterative arguments (in which one constructs an object through a series of improvements) require the choice of a potential function, and an argument about how much the potential function changes at every step. The power of the FTRL method is that it creates the potential function and a big part of the analysis automatically and, even where it does not work directly, it can serve as an inspiration.

One could imagine a counterfactual history in which people first proved the weak regularity lemma using online optimization out of the box, as we do in this post, and then decided to try and use an L2 potential function and an iterative method to get the Szemeredi lemma, subsequently trying to see what happens if the potential function is entropy, thus discovering Jacob Fox’s major improvement on the “triangle removal lemma,” which involves the construction of an approximator that just approximates the number of triangles.

**1. A “vanilla” weak regularity lemma **

Frieze and Kannan proved the following basic result about graph approximations, which has a number of algorithmic applications. If is a set of vertices which is understood from the context, and are disjoint subsets of vertices, then let , that is, the boolean matrix such that iff and .

The *cut norm* of a matrix is

In the following we will identify a graph with its adjacency matrix.

Theorem 1Let be an graph on vertices and be an approximation parameter.Then there are sets and scalars , where , such that if we define

we have

We will prove the following more general version.

Theorem 2Let be a set, be a bounded function, be a family of functions mapping to and be an approximation parameter. Then there are functions in and scalars , with , such that if we definewe have

We could also, with the same proof, argue about a possibly infinite set with a measure such that is finite, and, after defining the inner product

we could prove the same conclusion of the theorem, with instead of as an error bound.

Here is the proof: run the FTRL algorithm with L2-squared regularizer in the setup in which the space of solutions is the set of all functions and the loss functions are linear. Every time the algorithm proposes a solution , if there is a function such that either or , the adversary will pick, respectively, or as a loss function . When the adversary has no such choice, we stop and the function is our desired approximation.

First of all, let us analyze the number of rounds. Here the maximum norm of the functions in is , so after rounds we have the regret bound

Now let us consider to be our offline solution: we have

which implies

Finally, recall that

where is the scaling constant in the definition of the regularizer ( is of order of when is order of ), and so our final approximator computed at the last round is a weighted sum of functions from .

**2. The weak regularity lemma **

Frieze and Kannan’s weak regularity lemma has the following form.

Theorem 3Let be an graph on vertices and be an approximation parameter.Then there is a partition of into sets , and there are bounded weights for such that if we defined the weighted graph where the weight of the edge in is , where and , then we have

Notice that if we did not require the weights to be between 0 and 1 then the result of the previous section can also be cast in the above language, because we can take the partition to be the “Sigma-algebra generated by” the sets .

For a scalar , let be defined as

where stands for *t*runcation. Note that is the L2 projection of on .

Theorem 3 is a special case of the following result, proved in our paper with Tulsiani and Vadhan.

Theorem 4Let be a set, be a bounded function, be a family of functions mapping to and be an approximation parameter. Then there are functions in and scalars , with , such that if we definewe have

To prove Theorem 4 we play the same online game as in the previous section: the online algorithm proposes a solution ; if then we stop and output , otherwise we let the loss function be a function such that either or is in and

The only difference is that we use the FTRL algorithm with L2 regularizer that has the set feasible solutions defined to be the set of all functions rather than the set of all functions . Then each function is the projection to of , and the projection to is just composition with . The bound on the number of steps is the same as the one in the previous section.

Looking at the case in which is the set of edges of a clique on , is the set of graphs of the form , and considering the Sigma-algebra generated by gives Theorem 3 from Theorem 4.

**3. Sampling High-Entropy Distributions **

Finally we discuss the application to sampling high-entropy distributions.

Suppose that is a distribution over of min-entropy , meaning that for every we have

where we think of the *entropy deficiency* as being small, such as a constant or

Let be a class of functions that we think of as being “efficient.” For example, could be the set of all functions computable by circuits of size for some size bound , such as, for example . We will assume that is in . Define

to be a bounded function . Fix an approximation parameter .

Then from Theorem 4 we have that there are functions , and scalars , all equal to for a certain parameter , such that if we define

Now define the probability distribution

Applying (1) to the case , we have

and we know that , so

and we can rewrite (2) as

and, finally

that is

which says that and are -indistinguishable by functions in . If we chose , for example, to be the class of functions computable by circuits of size , then and are -indistinguishable by circuits of size .

But is also samplable in a relatively efficient way using rejection sampling: pick a random , then output with probability and fail with probability . Repeat the above until the procedure does not fail. At each step, the probability of success is , so, assuming (because otherwise all of the above makes no sense) that, say, , the procedure succeeds on average in at most attempts. And if each is computable by a circuit of size , then is computable by a circuit of size .

The undesirable features of this result are that the complexity of sampling and the quality of indistinguishability depend exponentially on the randomness deficiency, and the sampling circuit is a non-uniform circuit that it’s not clear how to construct without advice. Impagliazzo’s recent results address both these issues.

Pingback: Online Optimization Post 7: Matrix Multiplicative Weights Update | in theory