We now discuss how to view proofs of certain regularity lemmas as applications of the FTRL methodology.
The Szemeredi Regularity Lemma states (in modern language) that every dense graph is well approximate by a graph with a very simple structure, made of the (edge-disjoint) union of a constant number of weighted complete bipartite subgraphs. The notion of approximation is a bit complicated to describe, but it enables the proof of counting lemmas, which show that, for example, the number of triangles in the original graph is well approximated by the (appropriately weighted) number of triangles in the approximating graph.
Analogous regularity lemmas, in which an arbitrary object is approximated by a low-complexity object, have been proved for hypergraphs, for subsets of abelian groups (for applications to additive combinatorics), in an analytic setting (for applications to graph limits) and so on.
The weak regularity lemma of Frieze and Kannan provides, as the name suggests, a weaker kind of approximation than the one promised by Szemeredi’s lemma, but one that is achievable with a graph that has a much smaller number of pieces. If is the “approximation error” that one is willing to tolerate, Szemeredi’s lemma constructs a graph that is the union of a
weighted complete bipartite subgraphs where the height of the tower of exponentials is polynomial in
. In the Frieze-Kannan construction, that number is cut down to a single exponential
. This result too can be generalized to graph limits, subsets of groups, and so on.
With Tulsiani and Vadhan, we proved an abstract version of the Frieze-Kannan lemma (which can be applied to graphs, functions, distributions, etc.) in which the “complexity” of the approximation is . In the graph case, the approximating graph is still the union of
complete bipartite subgraphs, but it has a more compact representation. One consequence of this result is that for every high-min-entropy distribution
, there is an efficiently samplable distribution with the same min-entropy as
, that is indistinguishable from
. Such a result could be taken to be a proof that what GANs attempt to achieve is possible in principle, except that our result requires an unrealistically high entropy (and we achieve “efficient samplability” and “indistinguishability” only in a weak sense).
All these results are proved with a similar strategy: one starts from a trivial approximator, for example the empty graph, and then repeats the following iteration: if the current approximator achieves the required approximation, then we are done; otherwise take a counterexample, and modify the approximator using the counterexample. Then one shows that:
- The number of iterations is bounded, by keeping track of an appropriate potential function;
- The “complexity” of the approximator does not increase too much from iteration to iteration.
Typically, the number of iterations is , and the difference between the various results is given by whether at each iteration the “complexity” increases exponentially, or by a multiplicative factor, or by an additive term.
Like in the post on pseudorandom constructions, one can view such constructions as an online game between a “builder” and an “inspector,” except that now the online optimization algorithm will play the role of the builder, and the inspector is the one acting as an adversary. The bound on the number of rounds comes from the fact that the online optimization algorithms that we have seen so far achieve amortized error per round
after
rounds, so it takes
rounds for the error bound to go below
.
We will see that the abstract weak regularity lemma of my paper with Tulsiani and Vadhan (and hence the graph weak regularity lemma of Frieze and Kannan) can be immediately deduced from the theory developed in the previous post.
When I was preparing these notes, I was asked by several people if the same can be done for Szemeredi’s lemma. I don’t see a natural way of doing that. For such results, one should maybe use the online optimization techniques as a guide rather than as a black box. In general, iterative arguments (in which one constructs an object through a series of improvements) require the choice of a potential function, and an argument about how much the potential function changes at every step. The power of the FTRL method is that it creates the potential function and a big part of the analysis automatically and, even where it does not work directly, it can serve as an inspiration.
One could imagine a counterfactual history in which people first proved the weak regularity lemma using online optimization out of the box, as we do in this post, and then decided to try and use an L2 potential function and an iterative method to get the Szemeredi lemma, subsequently trying to see what happens if the potential function is entropy, thus discovering Jacob Fox’s major improvement on the “triangle removal lemma,” which involves the construction of an approximator that just approximates the number of triangles.
1. A “vanilla” weak regularity lemma
Frieze and Kannan proved the following basic result about graph approximations, which has a number of algorithmic applications. If is a set of vertices which is understood from the context, and
are disjoint subsets of vertices, then let
, that is, the boolean matrix such that
iff
and
.
The cut norm of a matrix is
In the following we will identify a graph with its adjacency matrix.
Theorem 1 Let
be an graph on
vertices and
be an approximation parameter.
Then there are sets
and scalars
, where
, such that if we define
we have
We will prove the following more general version.
Theorem 2 Let
be a set,
be a bounded function,
be a family of functions mapping
to
and
be an approximation parameter. Then there are functions
in
and scalars
, with
, such that if we define
we have
We could also, with the same proof, argue about a possibly infinite set with a measure
such that
is finite, and, after defining the inner product
we could prove the same conclusion of the theorem, with instead of
as an error bound.
Here is the proof: run the FTRL algorithm with L2-squared regularizer in the setup in which the space of solutions is the set of all functions
and the loss functions are linear. Every time the algorithm proposes a solution
, if there is a function
such that either
or
, the adversary will pick, respectively,
or
as a loss function
. When the adversary has no such choice, we stop and the function
is our desired approximation.
First of all, let us analyze the number of rounds. Here the maximum norm of the functions in is
, so after
rounds we have the regret bound
Now let us consider to be our offline solution: we have
which implies
Finally, recall that
where is the scaling constant in the definition of the regularizer (
is of order of
when
is order of
), and so our final approximator
computed at the last round is a weighted sum of functions from
.
2. The weak regularity lemma
Frieze and Kannan’s weak regularity lemma has the following form.
Theorem 3 Let
be an graph on
vertices and
be an approximation parameter.
Then there is a partition of
into
sets
, and there are bounded weights
for
such that if we defined the weighted graph
where the weight of the edge
in
is
, where
and
, then we have
Notice that if we did not require the weights to be between 0 and 1 then the result of the previous section can also be cast in the above language, because we can take the partition to be the “Sigma-algebra generated by” the sets
.
For a scalar , let
be defined as
where stands for truncation. Note that
is the L2 projection of
on
.
Theorem 3 is a special case of the following result, proved in our paper with Tulsiani and Vadhan.
Theorem 4 Let
be a set,
be a bounded function,
be a family of functions mapping
to
and
be an approximation parameter. Then there are functions
in
and scalars
, with
, such that if we define
we have
To prove Theorem 4 we play the same online game as in the previous section: the online algorithm proposes a solution ; if
then we stop and output
, otherwise we let the loss function be a function
such that either
or
is in
and
The only difference is that we use the FTRL algorithm with L2 regularizer that has the set feasible solutions defined to be the set of all functions
rather than the set of all functions
. Then each function
is the projection to
of
, and the projection to
is just composition with
. The bound on the number of steps is the same as the one in the previous section.
Looking at the case in which is the set of edges of a clique on
,
is the set of graphs of the form
, and considering the Sigma-algebra generated by
gives Theorem 3 from Theorem 4.
3. Sampling High-Entropy Distributions
Finally we discuss the application to sampling high-entropy distributions.
Suppose that is a distribution over
of min-entropy
, meaning that for every
we have
where we think of the entropy deficiency as being small, such as a constant or
Let be a class of functions
that we think of as being “efficient.” For example,
could be the set of all functions computable by circuits of size
for some size bound
, such as, for example
. We will assume that
is in
. Define
to be a bounded function . Fix an approximation parameter
.
Then from Theorem 4 we have that there are functions
, and scalars
, all equal to
for a certain parameter
, such that if we define
Now define the probability distribution
Applying (1) to the case , we have
and we know that , so
and we can rewrite (2) as
and, finally
that is
which says that and
are
-indistinguishable by functions in
. If we chose
, for example, to be the class of functions computable by circuits of size
, then
and
are
-indistinguishable by circuits of size
.
But is also samplable in a relatively efficient way using rejection sampling: pick a random
, then output
with probability
and fail with probability
. Repeat the above until the procedure does not fail. At each step, the probability of success is
, so, assuming (because otherwise all of the above makes no sense) that, say,
, the procedure succeeds on average in at most
attempts. And if each
is computable by a circuit of size
, then
is computable by a circuit of size
.
The undesirable features of this result are that the complexity of sampling and the quality of indistinguishability depend exponentially on the randomness deficiency, and the sampling circuit is a non-uniform circuit that it’s not clear how to construct without advice. Impagliazzo’s recent results address both these issues.
Pingback: Online Optimization Post 7: Matrix Multiplicative Weights Update | in theory