Online Optimization Post 4: Regularity Lemmas

We now discuss how to view proofs of certain regularity lemmas as applications of the FTRL methodology.

The Szemeredi Regularity Lemma states (in modern language) that every dense graph is well approximate by a graph with a very simple structure, made of the (edge-disjoint) union of a constant number of weighted complete bipartite subgraphs. The notion of approximation is a bit complicated to describe, but it enables the proof of counting lemmas, which show that, for example, the number of triangles in the original graph is well approximated by the (appropriately weighted) number of triangles in the approximating graph.

Analogous regularity lemmas, in which an arbitrary object is approximated by a low-complexity object, have been proved for hypergraphs, for subsets of abelian groups (for applications to additive combinatorics), in an analytic setting (for applications to graph limits) and so on.

The weak regularity lemma of Frieze and Kannan provides, as the name suggests, a weaker kind of approximation than the one promised by Szemeredi’s lemma, but one that is achievable with a graph that has a much smaller number of pieces. If {\epsilon} is the “approximation error” that one is willing to tolerate, Szemeredi’s lemma constructs a graph that is the union of a {2^{2^{\vdots}}} weighted complete bipartite subgraphs where the height of the tower of exponentials is polynomial in {1/\epsilon}. In the Frieze-Kannan construction, that number is cut down to a single exponential {2^{O(1/\epsilon^2)}}. This result too can be generalized to graph limits, subsets of groups, and so on.

With Tulsiani and Vadhan, we proved an abstract version of the Frieze-Kannan lemma (which can be applied to graphs, functions, distributions, etc.) in which the “complexity” of the approximation is {O(1/\epsilon^2)}. In the graph case, the approximating graph is still the union of {2^{O(1/\epsilon^2)}} complete bipartite subgraphs, but it has a more compact representation. One consequence of this result is that for every high-min-entropy distribution {\cal D}, there is an efficiently samplable distribution with the same min-entropy as {\cal D}, that is indistinguishable from {\cal D}. Such a result could be taken to be a proof that what GANs attempt to achieve is possible in principle, except that our result requires an unrealistically high entropy (and we achieve “efficient samplability” and “indistinguishability” only in a weak sense).

All these results are proved with a similar strategy: one starts from a trivial approximator, for example the empty graph, and then repeats the following iteration: if the current approximator achieves the required approximation, then we are done; otherwise take a counterexample, and modify the approximator using the counterexample. Then one shows that:

  • The number of iterations is bounded, by keeping track of an appropriate potential function;
  • The “complexity” of the approximator does not increase too much from iteration to iteration.

Typically, the number of iterations is {O(1/\epsilon^2)}, and the difference between the various results is given by whether at each iteration the “complexity” increases exponentially, or by a multiplicative factor, or by an additive term.

Like in the post on pseudorandom constructions, one can view such constructions as an online game between a “builder” and an “inspector,” except that now the online optimization algorithm will play the role of the builder, and the inspector is the one acting as an adversary. The {O(1/\epsilon^2)} bound on the number of rounds comes from the fact that the online optimization algorithms that we have seen so far achieve amortized error per round {O(1/\sqrt T)} after {T} rounds, so it takes {O(1/\epsilon^2)} rounds for the error bound to go below {\epsilon}.

We will see that the abstract weak regularity lemma of my paper with Tulsiani and Vadhan (and hence the graph weak regularity lemma of Frieze and Kannan) can be immediately deduced from the theory developed in the previous post.

When I was preparing these notes, I was asked by several people if the same can be done for Szemeredi’s lemma. I don’t see a natural way of doing that. For such results, one should maybe use the online optimization techniques as a guide rather than as a black box. In general, iterative arguments (in which one constructs an object through a series of improvements) require the choice of a potential function, and an argument about how much the potential function changes at every step. The power of the FTRL method is that it creates the potential function and a big part of the analysis automatically and, even where it does not work directly, it can serve as an inspiration.

One could imagine a counterfactual history in which people first proved the weak regularity lemma using online optimization out of the box, as we do in this post, and then decided to try and use an L2 potential function and an iterative method to get the Szemeredi lemma, subsequently trying to see what happens if the potential function is entropy, thus discovering Jacob Fox’s major improvement on the “triangle removal lemma,” which involves the construction of an approximator that just approximates the number of triangles.

Continue reading

The Triangle Removal Lemma

[At the end of a survey paper on additive combinatorics and computational complexity which is to appear in SIGACT News, I list three major open questions in additive combinatorics which might be amenable to a “computer science proof.” They are all extremely well studied questions, by very smart people, for the past several years, so they are all very long shots. I don’t recommend anybody to start working on them, but I think it is good that as many people as possible know about these questions, because when the right technique comes along its applicability can be more quickly realized.]

The first question is to improve the Triangle Removal Lemma. I have talked here about what the triangle removal lemma is, how one can prove it from the Szemerédi Regularity Lemma, and how it implies the length-3 case of Szemerédi’s Theorem.

As a short recap, the Triangle Removal Lemma states that if {G} is an {n}-vertex graph with {o(n^3)} triangles, then there is a set of {o(n^2)} edges such that the removal of those edges eliminates all the triangles. Equivalently, it says that if a graph has {\Omega(n^2)} triangles which are all pair-wise edge-disjoint, then there must be {\Omega(n^3)} triangles overall.

The connection with Szemerédi’s Theorem is that if {H} is an abelian group with {n} elements, and {A} is a subset of {H} with no length-3 arithmetic progressions (i.e., {A} is such that there are no three distinct elements {a,b,c} in {A} such that {b-a = c-b}), then we can construct a graph {G=(V,E)} that has {3n} vertices, {|A| \cdot n} pair-wise edge-disjoint triangles, and no other triangles. This contradicts the triangle removal lemma if {|A| = \Omega(n)}, and so we must have {|A| = o(n)}.

This is great, until we start looking at the relationships between the constants hidden by the {o(\cdot )} notation. Quantitatively, the Triangle Removal Lemma states that for every {\epsilon} there is a {\delta = \delta(\epsilon)} such that if a graph has at least {\epsilon \cdot n^2} pair-wise edge-disjoint triangles, then it has at least {\delta \cdot n^3} triangles. The only known proof, however, has {\delta} incredibly small: {1/\delta} grows like a tower of exponentials of height polynomial in {1/\epsilon}. The proof uses the Szemerédi Regularity Lemma, and the Regularity Lemma is known to require such very bad dependencies.

63 years ago, Behrend showed that {{\mathbb Z}/N{\mathbb Z}}, {N} prime, has a subset {A} that contains no length-3 arithmetic progression and whose size is {N/2^{O(\sqrt {\log N})}}. (Last year, Elkin gave the first improvement in 62 years to Behrend’s bound, but the improvement is only a multiplicative polylog {N} factor.) Combined with the graph construction mentioned above, this gives a graph with {3N} vertices, {N^2/^{O(\sqrt {\log N})}} edge-disjoint triangles, and no other triangle. Thus, the graph has {\leq \delta N^3} triangles where {\delta < 1/N}, but one needs to remove {> \epsilon N^2} edges to make it triangle-free, where {\epsilon > 2^{-O(\sqrt{\log N})}}. This shows that, in the Triangle Removal Lemma, {1/\delta} must grow super-polynomially in {1/\epsilon}, and be at least {1/\epsilon^{\log 1/\epsilon}}.

The question is to shorten the gap between the tower-of-exponential relationship between {1/\delta} and {1/\epsilon} coming from the proof via the Szemerédi Regularity Lemma and the mildly super-polynomial lower bound coming from the above argument.

Continue reading

Applications of Low-Complexity Approximations

In the last post, I stated the following generalization of the weak regularity lemma:

Theorem (Low Complexity Approximation, TTV Version) Let (X,\mu) be a probability space, g:X \rightarrow [-1,1] a bounded function, F a collection of bounded functions f:X \rightarrow [ -1,1], and \epsilon an approximation parameter.

Then there is a function h: X \rightarrow [-1,1] such that

  • h has low complexity relative to F: there are k= O( \epsilon^{-2}) functions f_i\in F and coefficients c_i such that

    \displaystyle h(x) := \max \{ -1 , \min \{ 1, \sum_{i=1}^k c_i f_i (x)  \}\}

  • h is \epsilon-indistinguishable from g by F, that is,

    \forall f.F \ \ \left| {\mathbb E}_{x \sim \mu} f(x) g(x) - {\mathbb E}_{x \sim \mu} f(x) h(x) \right|  \leq \epsilon

(Last time, I mentioned that our proof handled only boolean functions f; now we can handle arbitrary bounded functions, and with an “energy-decrease” style proof, this will appear in the next online revision of the paper.)

This seems to be a useful tool, limited only by one’s creativity in choosing the functions F and then making use of the properties of h.

As already discussed,

  • if one takes X to be the edges of a complete graph, and F the set of indicator variables of cuts, then the existence of h gives the weak regularity lemma of Frieze and Kannan; and
  • if one takes F to be the set of circuits of size at most S(n), and normalizes g and h to be probability distributions, one gets that for every probability distribution D of high entropy there is a (non-uniformly) efficiently samplable and computable distribution M that is indistinguishable from D by circuits of size \leq S(n).

In this post I’ll show how to use it to give short proofs of the Hardcore Lemma of Impagliazzo and the Dense Model Theorem of Green, Tao and Ziegler. Both proofs also have, at least in hindsight, a sense of “inevitability,” meaning that given the Low-Complexity Approximation Theorem, and given what we want to prove, both proofs get to the point in a most economical and natural way.

  1. The Impagliazzo Hardcore Lemma. We have already mentioned that if g is “hard-on-average” for F, then h cannot be an approximation in the sense of being close to g on most inputs. What, then, about the points on which g and h differ? They form an Impagliazzo Hardcore Set for g, as described next.

    Let g: X \rightarrow \{-1,1\} be a function that is weakly hard on average for a class of algorithms F. Suppose, specifically, that for every algorithm h of complexity O(\epsilon^{-2}\delta^{-2}) relative to F we have

    {\mathbb P} [ g(x) = h(x) ] \leq 1-\delta

    and, more generally, for fractional h, we have

    (1) \ \ \ {\mathbb E} | g(x) - h(x) | \geq 2\delta

    Then, construct an approximating function h of complexity
    \leq \epsilon^{-2}\delta^{-2} relative to F and such that h and g are \epsilon\delta-indistinguishable by F. Note that, even though h is “indistinguishable” from g, it is also “far” from g, as in (1).

    Define a probability distribution H that assigns to point x a probability proportional to |g(x)-h(x)|. (If h were boolean, this would be the uniform distribution over points on which h and g differ.) Then this distribution is \delta-dense in the uniform distribution, meaning that every point has probability at most (\delta |X|)^{-1}. Observe also that we have

    g(x)\cdot |g(x)-h(x)| = g(x)-h(x)

    for every x, because g(x) and g(x)-h(x) have the same sign and |g(x)|=1, so we have

    \begin{array}{ll} \displaystyle {\mathbb E}_H g(x)f(x) & \displaystyle = \sum_x \frac {|g(x)-h(x)|}{\sum_y |g(y)-h(y)|} \cdot g(x)f(x) \\ & \displaystyle = \frac {|X|}{\sum_y |g(y)-h(y)|} \cdot {\mathbb E}_X |g(x)-h(x)|g(x)f(x) \\ & \displaystyle \leq \frac {1}{\delta} {\mathbb E}_X  (g(x)-h(x))f(x)  \\ & \displaystyle \leq \epsilon \end{array}

    and so H is a hardcore distribution, because the above expression is equivalent to

    {\mathbb P}_H [ g(x) = f(x) ] \leq \frac 12 + \frac \epsilon 2

  2. The Dense Model Theorem. Suppose that R\subseteq X is a pseudorandom set with respect to functions that have bounded complexity relative to F, and let D\subseteq R be a dense subset of R, |D| = \delta |R|.

    To find a dense model of D, we take g to be the characteristic function of D, and we let h be the low-complexity approximation, but using the uniform distribution on R as \mu.. Now suppose for simplicity that h is boolean, and that M is the set of inputs of X on which h is 1. We want to argue that M is a dense model of D. By assuming without loss of generality that F contains the all-one function, we get from the indistinguishability of g and h that

    \delta = {\mathbb E}_R g \approx {\mathbb E}_R h

    and from the pseudorandomness of R we have

    {\mathbb E}_R h \approx {\mathbb E}_X h = |M|/|X|

    and so |M| \approx \delta |X| and M is indeed dense.

    For the indistinguishability of M and D, take any function f, and observe that

    {\mathbb E}_M f = \delta^{-1} {\mathbb E}_R fg \approx \delta^{-1} {\mathbb E}_R fh \approx \delta^{-1} {\mathbb E}_X fh \approx {\mathbb E}_M f

    where we use both the indistinguishability of g and h under distribution R, and the fact that the distributions R and X are indistinguishable by functions of bounded complexity.

    This proof is appealingly intuitive, in the sense that if we expect D to be indistinguishable from a large set, then when we try to approximate the characteristic function of D we will end up with a low complexity function that is spread around much of X, thus defining a dense model. It also shows that “relative” versions of the Regularity Lemma, such as the Regularity Lemma for subgraphs of an expander, may be derived from the regular Lemma by the above argument. A disadvantage of the argument is that it does not establish the stronger form of the Dense Model Theorem suggested by Impagliazzo, in which there is no set R, but we require D to have the “pseudo-density” requirement that for every low-complexity bounded function f,

    {\mathbb E}_{x\in D} |f(x)| \leq \frac 1 {\delta} {\mathbb E}_{X} |f(x)| + \epsilon

    which follows immediately if D has density \delta in a pseudorandom set R, but that is a seemingly weaker property. (The relative Regularity Lemma in graphs had long be known to hold under such a pseudo-density assumption.)

Boosting and the Regularity Lemma

In a previous post, I described abstract forms of the weak regularity lemma, in which we start from an arbitrary bounded function g: X \rightarrow [-1,1] and an arbitrary set F of “structured” bounded functions f: X \rightarrow [-1,1], and we construct an “approximatng” function h that has “low complexity” relative to F and is “indistinguishable” from g by functions from F. We had two proofs, both generalizations of arguments due to Frieze and Kannan: in one proof h is not bounded, but its complexity is O(\epsilon^{-2}), where \epsilon is the approximation parameter; in the second proof, h is bounded but it has exponential complexity exp(O(\epsilon^{-2}) in the approximation parameter.

In a new paper by Madhur Tulsiani, Salil Vadhan, and I, we give a new proof that gives an approximating function that, at the same time, is bounded and has low complexity. This has a number of applications, which I will describe in a future post.

(Note that, in the statement below, F is required to contain Boolean functions, rather than bounded ones. It is possible, however, to “reduce” the bounded case to the boolean case via the following observation: if F is a family of bounded functions, F' is the family of boolean functions \{ 1_{f(x) \geq t} \ : \ f\in F, t\in [-1,1] \}, and g and h are \epsilon-indistinguishable according to F', then they are also \epsilon-indistinguishable according to F.)

Theorem (Low Complexity Approximation, TTV Version) Let (X,\mu) be a probability space, g:X \rightarrow [-1,1] a bounded function, F a collection of boolean functions f:X \rightarrow \{ -1,1\}, and \epsilon an approximation parameter.

Then there is a function h: X \rightarrow [-1,1] such that

  • h has low complexity relative to F: there are k= O( \epsilon^{-2}) functions f_i\in F and coefficients c_i such that

    \displaystyle h(x) := \max \{ -1 , \min \{ 1, \sum_{i=1}^k c_i f_i (x)  \}\}

  • h is \epsilon-indistinguishable from g by F, that is,

    \forall f.F \ \ \left| {\mathbb E}_{x \sim \mu} f(x) g(x) - {\mathbb E}_{x \sim \mu} f(x) h(x) \right|  \leq \epsilon

That is, h is simply a linear combination of functions from F, whose value is truncated to be between -1 and 1.

Recall that, when we do not require h to be bounded, h can be constructed via the following algorithm:

Algorithm FK

  • h_0 := 0; t= 0
  • while \exists f_{t+1} \in F such that {\mathbb E} f_{t+1} \cdot (g-h_t) \geq \epsilon
    • h_{t+1} := \sum_{i=1}^{t+1} \gamma f_{i}
    • t:=t+1
  • return h_t

And the analysis shows that, if we call \Delta_t := g - h_t, then {\mathbb E} \Delta_t^2 - \Delta_{t+1}^2 \geq  2\epsilon\gamma - \gamma^2. Setting \gamma := \epsilon shows that the algorithm stops within \epsilon^{-2} steps.

Our proof proceeds by doing exactly the same, but making sure at every step that h is bounded.

Algorithm TTV

  • h_0 := 0; t= 0
  • while \exists f_{t+1} \in F such that {\mathbb E} f_{t+1} \cdot (g-h_t) \geq \epsilon
    • h_{t+1} := \max \{ -1, \min \{ 1, \sum_{i=1}^{t+1} \gamma f_i \} \}
    • t:=t+1
  • return h_t

The problem with the old analysis is that now we could conceivably have a step in which \sum_{i=1}^t \gamma f_i (x) > 1+\gamma for all x, and so h_t = h_{t+1}, and thus \Delta_t = \Delta_{t+1}, and we have no energy decrease. To get to such a state, however, there must have been about 1/\gamma steps in which h changed in the same way as in the Frieze-Kannan case. I think there should be a sort of “amortized analysis” that would “charge” the lack of change of h at certain steps to other steps in which it indeed changes, and establish that, for every time step T,

{\mathbb E} \Delta_0^2 - \Delta_T^2 \geq \Omega(T \cdot \gamma \cdot (\epsilon - O(\gamma))

Unfortunately I don’t know how to construct such a proof. Our proof, instead, follows step by step Impagliazzo’s proof of the Impagliazzo Hardcore Set Lemma, which employs a very similar algorithm (in a quite different context). As shown by Klivans and Servedio, the algorithm in Impagliazzo’s proof can be seen as a boosting algorithm in learning theory (and every other known boosting algorithm can be used to prove Impagliazzo’s Hardcore Lemma); this is why we think of our proof as a “boosting” proof.

I must confess, I have never really understood Impagliazzo’s proof, and so I can’t say I really understand ours, other than being able to reproduce the steps.

The idea is to consider the quantity f_{t+1}(x) \Delta_{t}(x), which depends on t and x, and to see how it behaves summed over t, for a fixed x, and how it behaves for a fixed t, summed over x.

Suppose the algorithm is still running after T steps. Then, because of the way we define the termination condition, we have, for every t \in \{ 0,\ldots, T-1\}

\displaystyle (1) \ \ \ {\mathbb E}_x f_{t+1}(x)  (g(x)-h_t(x)) \geq  \epsilon

and the crux of the proof is to show that for every x we have

(2) \displaystyle \ \ \ \sum_{i=0}^{T-1} f_{t+1}(x)  (g(x)-h_t(x)) \leq \frac \gamma 2 \cdot T  + \frac 4 {\gamma}

So if we sum (1) over t and average (2) over x, we get

\displaystyle  \epsilon T \leq  \frac  \gamma 2 \cdot T  + \frac 4 {\gamma}

and setting \gamma := \epsilon gives T \leq 8 \epsilon^{-2}.

Inequality (2), the main step of the proof, is the part for which I have little intuition. One breaks the summation into groups of time steps, depending on the value of \Delta_t; there are 2/\gamma groups (because the value of h_t changes by discrete increments of \gamma, and is between -1 and +1) and each one is shown to contribute O(\gamma T') + O(1), where T' is the number of time steps in the group.

It is perhaps instructive to translate the Frieze-Kannan proof to this set-up. In the Frieze-Kannan algorithm, we have

\displaystyle  f_{t+1}(x)  (g(x)-h_t(x))  = \frac 1 {2\gamma} \cdot \left( \Delta_{t}^2 - \Delta^2_{t+1} +\gamma^2 f_{t+1}^2 \right)

and so

\begin{array}{ll} \displaystyle  \sum_{t=0}^{T-1} f_{t+1}(x)  (g(x)-h_t(x)) & = \displaystyle \frac 1{2\gamma} (\Delta_0^2 - \Delta_{T}^2) + \frac {\gamma}2 \cdot \sum_{t=1}^{T} f_t^2(x) \\  & \leq \displaystyle  \frac 1 {2\gamma}  + \frac \gamma 2 \cdot  T \end{array}

which is analogous to our inequality (2).

Decomposition Results and Regularity Lemmas

Several results in additive combinatorics have the flavor of decomposition results in which one shows that an arbitrary combinatorial object can be approximated by a “sum” of a “low complexity” part and a “pseudorandom” part. This is helpful because it reduces the task of proving that an arbitrary object has a certain property to the easier task of proving that low-complexity objects and pseudorandom objects have the property. (Provided that the property is preserved by the approximation and the sum.)

Many examples of such results are given in the paper accompanying Terry Tao’s tutorial at FOCS’07.

Usually, one needs a strong notion of approximation, and in such cases the complexity of the “low complexity” part is fantastically large (a tower of exponentials in the approximation parameter), as in the Szemeredi Regularity Lemma, which can be seen as a prototypical such “decomposition” result. In some cases, however, weaker notions of approximation are still useful, and one can have more reasonable complexity parameters, as in the Weak Regularity Lemma of Frieze and Kannan.

A toy example of a “weak decomposition result” is the following observation: Continue reading

Best Tutorials Ever

FOCS 2007 started yesterday in Providence with a series of tutorials.

Terry Tao gave a talk similar to the one he gave in Madrid, discussing the duality between pseudorandomness and efficiency which is a way to give a unified view of techniques coming from analysis, combinatorics and ergodic theory.

In typical such results, one has a set $F$ of “simple” functions (for example linear, or low-degree polynomials, or, in conceivable complexity-theoretic applications, functions of low circuit complexity) and one wants to write an arbitrary function $g$ as

$ g(x) = g_{pr} (x) + g_{str} (x) + g_{err} (x) $

where $g_{pr}$ is pseudorandom with respect to the “distinguishers” in $F$, $g_{str}$ is a “simple combination” of functions from $\cal F$, and $g_{err}$ accounts for a possible small approximation error. There are a number of ways to instantiate this general template, as can be seen on the accompanying notes, and it is nice to see how even the Szemeredi regularity lemma can be fit into this template. (The “functions” are adjacency matrices of graphs, and the “efficient” functions are complete bipartite subgraphs.)

Dan Boneh spoke on pairing-based cryptography, an idea that has grown into a whole, rich, area, with specialized conferences and, according to Google Scholar, 1,200+ papers published so far. In this setting one has a group $G$ (for example points on an elliptic curve) such that there is a mapping $e: G X G \rightarrow G_T$ that takes pairs of elements of $G$ into an element of another group $G_T$ satisfying a bilinearity condition. (Such a mapping is a “pairing,” hence the name of the area.) Although such mapping can lead to attacks on the discrete log problem in $G$, if the mapping is chosen carefully one may still assume intractability of discrete log in $G$, and the pairing can be very useful in constructing cryptographic protocols and proving their security. In particular, one can get “identity-based encryption,” a type of public key cryptography where a user’s public key can be her own name (or email address, or any deterministically chosen name), which in turn can be used as a primitive in other applications.

Dan Spielman spoke on spectral graph theory, focusing on results and problems that aren’t quite studied enough by theoreticians. He showed some remarkable of example of graph drawings obtained by simply plotting a vertex $i$ to the point $(v(i),w(i))$, where $v$ and $w$ are the second largest and third largest eigenvalues of the laplacian of the adjacency matrix. The sparse cut promised by Cheeger inequality is, in such a drawing, just the cut given by a vertical line across the drawing, and there are nice algebraic explanations for why the drawing looks intuitively “nice” for many graphs but not for all. Spectral partitioning has been very successful for image segmentation problems, but it has some drawbacks and it would be nice to find theoretically justified algorithms that would do better.

Typically, I don’t go to an Italian restaurant in the US unless I have been there before and liked it, a rule that runs into certain circularity problems. I was happy that yesterday I made an exception to go to Al Forno, which proved to be truly exceptional.

The unreasonable effectiveness of additive combinatorics in computer science

As I have written several times on these pages, techniques from additive combinatorics seem to be very well suited to attack problems in computer science, and already a good amount of applications have been found. For example, “sum-product theorems” originally developed in a combinatorial approach to the Kakeya problem have been extremely valuable in recent constructions of randomness extractors. The central theorem of additive combinatorics, Szemeredi’s theorem, has now four quite different proofs, one based on graph theory and Ramsey theory, one based on analytical methods, one based on ergodic theory and one based on hypergraph theory. The first proof introduced the Szemeredi regularity lemma, which is a fixture of algorithmic work on property testing. The analytical proof of Gowers introduced the notion of Gowers uniformity that, so far, has found application in PCP constructions, communication complexity , and pseudorandomness. There is also work in progress on complexity-theoretic applications of some of the ergodic-theoretic techniques.

Why is it the case that techniques developed to study the presence of arithmetic progressions in certain sets are so useful to study such unrelated notions as sub-linear time algorithms, PCP systems, pseudorandom generators, and multi-party protocols? This remains, in part, a mystery. A unifying theme in the recent advances in additive combinatorics is the notion that every large combinatorial object can be “decomposed” into a “pseudorandom” part and a “small-description” part, and that many questions that we might be interested in are easy to answer, at least approximately, on pseudorandom and on small-description objects. Since computer scientists almost always deal with worst-case scenario, and are typically comfortable with approximations, it is reasonable that we can take advantage of techniques that reduce the analysis of arbitrary worst cases to the analysis of much simpler scenarios.

Whatever the reason for their effectiveness, it is worthwhile for any theoretical computer scientist to learn more about this fascinating area of math. One of the tutorials in FOCS 2007 will be on additive combinatorics, with a celebrity speaker. More modestly, following Random-Approx 2007, in Princeton, there will be a course on additive combinatorics for (and by) computer scientists. (If you want to go, you have to register by August 1 and reserve the hotel by this weekend.)

Entropy and the weak Regularity Lemma

If you remember my discussion of Szemeredi’s proof of the Szemeredi Regularity Lemma, the proof goes by progressively refining an initial arbitrary partition until one gets an eps-regular partition. A potential function is used to prove that the number of refinement steps is at most poly(1/eps). I said the potential function measures “variance” or “disorder” in the densities of edges between blocks of the partition. I used this terminology because I was aware of an alternate proof by Tao where the Regularity Lemma is reformulated as a statement about random variables and the proof uses conditional entropy (but not in a straightforward way) as a potential function. Tao’s proof gives a stronger version of the Regularity Lemma, similar to another strong version that was proved by Alon, Fischer, Krivelevich and Szegedy, motivated by applications to property testing.

I still cannot understand the intuition of the stronger statement, the role of the parameter “F” in Tao’s proof, and the role of the “fine” partition. The probabilistic interpretation, however, leads to an amusing three-line proof of the weak Regularity Lemma. (As we shall see, unfortunately, one of the three lines is too hard for me to figure out.)

Before stating the weak Regularity Lemma, let us define the notion of weakly regular partition. Let G=(V,E) be a graph, (V1,…,Vk) be a partition of the vertices, and for any two blocks of the partition define the density

d(Vi,Vj) := e(Vi,Vj) / |Vi| * |Vj|

where e(Vi,Vj) is the number of edges between Vi and Vj; it is also convenient to have the notation

d(Vi,Vi) := 2*e(Vi) / |Vi|2

where e(Vi) is the number of edges within Vi.

We say that the partition (V1,…,Vk) is weakly eps-regular if for every two disjoint sets S,T of vertices we have that the quantity

sum{i,j} |V_i intersection S| * |V_j intersection T| * d(Vi,Vj)

approximates e(S,T) up to an additive error of at most eps |V|2.

Finally we can state the weak Regularity Lemma

Weak Szemeredi Regularity Lemma. For every eps there is a c(eps) such that every graph G=(V,E) admits a weakly eps-regular partition (V1,…,Vk) where k=2{poly(1/eps)}.

Note that the number of elements of the partition is singly exponential in 1/eps, instead of being a tower of exponentials. This is a considerably improvement, but the weaker property of the partition is not suitable for many applications, including to property testing and to additive combinatorics.

The observant reader should see how to modify the original proof to fit this new setting. (Start from an arbitrary partition, if it is not good, use the sets S and T to refine, and so on; now each refinement step increases the size of the partition only by a constant factor, and, as before, we only have poly(1/eps) refinements.)

What about the information-theoretic proof? It will be a simplification of Tao’s proof to the case of the weak Regularity Lemma. Define the following random variables: X and Y are two independent and uniformly distributed vertices, and E is the 0/1 indicator of the event that (X,Y) is an edge.

We use the notation H(Z) for the entropy of a random variable Z, H(Z|W) as the entropy of Z conditioned on W, I(Z,W):= H(Z)-H(Z|W) for the mutual information of Z and W and I(Z,W|R):= H(Z|R)-H(Z|R,W) for the mutual information of Z and W conditioned on R.

Finally, we come to the proof. We ask the question: Are there boolean functions f1,g1:V -> {0,1} such that I(E, (f1(X),g1(Y))) > eps? That is, after someone picks X and Y at random, is there one bit of information we can ask about X and one bit of information we ask about Y that helps us guess whether there is an edge between X and Y? Equivalently, are there sets S and T such that knowing whether X is in S and whether Y is in T helps us guess whether there is an edge between X and Y? If not, it can be checked that the number of edges between any two sets S and T essentially depends only on the size of S and T, and V itself is a weakly poly(eps)-regular partition. (It is a “partition” with only one block.)

Otherwise, we ask whether there are functions f2,g2 such that

I(E,(f1(X),g1(X),f1(Y),g1(Y),f2(X),g2(Y))> 2eps

If not, then for every two boolean functions f2,g2, once we know f1(X),g1(X),f1(Y),g1(Y), it does not help to know f2(X) and g2(Y) in order to guess whether (X,Y) is an edge. This means that if we partition V into 4 blocks (corresponding to possible values of f1,g1), then for every two sets S,T (corresponding to f2,g2), the number of edges between S and T essentially only depends on the sizes of the intersections of S and T with the four blocks of the partition. Hence, the partition is weakly regular.

Alternatively, we keep going, but, if we increase the mutual information by eps at each step, we cannot go for more than 1/eps steps. So there must be a k < 1/eps and 2k functions f1,…,fk,g1,…,gk such that, for all functions f,g, if we know the 4k bits f1(X),f1(Y),…,gk(X),gk(Y), then we gain less than another eps bits of information about E if we are also told f(X) and g(Y). Someone familiar with inequalities about entropy (this is the step I am missing) should be able to infer that if we partition V into the 22k subsets corresponding to all possible values of fi and gi, such a partition must be eps’-weakly regular with eps’=poly(eps). (Probably about eps1/2.)

Maybe this is not really three lines (in fact it is possibly more complicated than the “variance” argument), but I feel that one can see what is going on much better than in the “variance” argument.

The Szemeredi Regularity Lemma

This semester I will mostly be securing the cyberspace, but I also plan to learn more about additive combinatorics and its applications to computer science.

In the last few days I have been trying to understand the proof of the Szemeredi Regularity Lemma. It is actually not as hard as I thought.

The Szemeredi Regularity Lemma states that every graph can be approximately described by an object whose size is a constant that depends only on the accuracy of the approximation, and not on the number of vertices of the graph.

This remarkable “approximate classification theorem for graphs” has several applications in extremal combinatorics, additive combinatorics, and computer science.

(Indeed, besides such applications, one may imagine using the Regularity Lemma as follows: to prove an approximate quantitative statement about arbitrary graphs, we only need to prove a related statement for the finite set of objects that approximate arbitrary graphs up to a certain level of accuracy. The latter, completely finite, goal may be performed via a computer search. Unfortunately, the size of the objects arising in the Regularity Lemma grow so astronomically fast in the approximation that this approach is completely impractical.)

Roughly speaking, the Lemma says that for every graph G=(V,E) and every parameter eps, we can find a partition of V into equal-size subsets V1,…,Vk such that, for almost all pairs i,j, the edges between Vi and Vj form a dense bipartite “expander.” Furthermore, k depends only on eps.

Let us define the notion of “expansion” that we will use. Let G=(U,V,E) be a bipartite graph, where U and V are sets of vertices and E is the set of edges. Let A be a subset of U and B be a subset of V. Define the density d(A.B) of the edges between A and B as e(A,B)/|A|*|B|, that is, the number of edges between A and B divided by the maximum number of possible edges. If G were a random graph, we would expect the density d(A,B) to be approximately the same for all large enough sets A and B, and to always be approximately |E|/|U|*|V|. We say that U and V are eps-regular if for every A of size at least eps*|U| and every B of size at least eps|V| the density d(A,B) is equal to |E|/|U|*|V| plus or minus eps.

Szemeredi Regularity Lemma For every eps there is a constant c(eps) such that for every t and every graph G=(V,E) we can find a partitio of V into at least t and at most t*c(eps) sets of equal size V1…Vk, such that for at least a 1-eps fraction of the pairs i,j, the sets Vi and Vj are eps-regular.

[A minor technical point: the number of vertices of the graph may not be divisible by the number of sets in the partition, so we allow one of the sets in the partition to be slightly smaller than the other k-1 sets.]

The idea of the proof is to start with an arbitrary partition of the sets of vertices into t sets. If the partition satisfies the lemma, we are done. Otherwise, for many pairs Vi,Vj we can find a subset Aij of Vi and a subset Aji of Vj such that both subsets are large but the density of the edges between them is not close to the density of the edges between Vi and Vj. We then refine our partition using these sets. This means that we take every set in the partition and we further subdivide it into subsets (in a way that we describe shortly). The refinement is chosen in such a way that the densities in the new partition are more “disordered” or have more “variance” than they had before. In particular, after the refinement, we show that a function that measures such “disorder” increases by at least poly(eps). Such function can never be more than 1, so we never need to refine the partition more than poly(1/eps) times. At the end, we are left with a partition that satisfies the Lemma and that contains a number of sets that depends only on eps.

It now remains to: (i) define the function that measures disorder; (ii) describe the refinement step; (iii) prove that the refinement increases the disorder function by at least poly(eps).

Let V1,….,Vk be a partition of V, and let pi=|Vi|/|V| be the fraction of elements in Vi. We define the variance of the partition as

f(V1,…,Vk):= sumij pi*pj*(d(Vi,Vj))2

We can interpret this quantity as follows. Consider the random variable X defined by the following experiment: we pick at random, independently, two vertices u,v, we let Vi be the set containing u and Vj be the set containing v, and we define X as d(V_i,V_j). Then f(V1,…,V_k) is the expectation of X2. Also, observe that the expectation of X is simply |E|/|V|2, independent of the partition, and so f(V1,…,Vk) indeed measures, up to additive factors, the variance of X.

It is now an exercise in the use of Cauchy-Schwartz to prove that f() cannot decrease if we refine the partition by splitting a set into two or more subsets.

Now, suppose that Vi and Vj are not an eps-regular pair, and let A and B be subsets witnessing this fact. Split Vi into A,Vi-A and Vj into B,Vj-B. This is now a more “disordered” partition, because the density d(A,B) is noticeably more (or noticeably less) than d(Vi,Vj), and so it also has to be noticeably different from at least one of the three other densities d(A,Vj-B), d(Vi-A,B), d(Vi-A,Vj-B). In fact, we can show that the contribution of the pair Vi,Vj to the variance grows from pipjd2(Vi,Vj) to at least pipjd2(Vi,Vj) + const*pipj*eps4.

If our partition is not eps-regular, then at least an eps fraction of all pairs is not eps-regular, and, by splitting each pair we can increase the variance by at least const*eps5.

There is, however, a difficulty. Let’s say that the pair V1 and V2 is not eps-regular, (as witnessed by sets A,B) and neither is V1 and V3 (as witnessed by sets A’ and C). What do we do if A and A’ are different? Do we split V1 according to A or to A’.

The answer is both, that is, we split V1 into four sets so that A is the union of two of them, and A’ is also the union of two of them. In general, if V1 participates into k-1 non-eps-regular pairs, then we split V1 into 2k-1 subsets. For each non-eps-regular pair Vi,Vj, where A,B are the sets witnessing the non-regularity, this refinement is finer than the refinement that simply breaks Vi into A,Vi-A and Vj into B,Vj-B, and so it increases the contribution of Vi,Vj to the variance by at least as much.

This refinement creates subsets of different sizes, while we want the sets to have the same size. We can adjust the size by further refining the large sets and possibly merging small sets, a process that can slightly reduce the variance but, if done properly, by less than half of the amount we gained previously.

Overall, the number of sets in the partition increases exponentially, and the variance increases by at least const*eps5. This means that we reach an eps-regular partition after no more than O(1/eps5) steps. The number of sets in the partition, unfortunately, can now be as large as a tower of exponentials of height O(1/eps5).

Gowers has shown that such towers-of-exponential growth is necessary.

Property testing and Szemeredi’s Theorem

After discussing Szemeredi’s Theorem and the analytical approaches of Roth and Gowers, let’s see some ideas and open problems in the combinatorial approach.

The following result is a good starting point.

Triangle Removal Lemma. For every \delta >0 there is a constant \epsilon(\delta) > 0 such that if G is an n-vertex graph with at most \epsilon(\delta)n^3 triangles, then it is possible to make G triangle-free by removing at most \delta n^2 edges.

This result follows easily from the Szemeredi Regularity Lemma, and it is a prototype of several results in the field of property testing.

Indeed, consider the problem of distinguishing triangle-free graphs from graphs that are not even close to being triangle-free. For the purpose of this example, we think of a graph as “close to being triangle-free” if it can be made triangle-free by removing at most \delta n^2 edges, for some small constant \delta > 0. Then the Lemma tells us that in a not-even-close graph we have at least \epsilon(\delta) n^3 triangles, and so a sample of O(1/\epsilon(\delta)) vertices is likely to contain a triangle. So here is an algorithm for the problem: pick at random O(1/\epsilon(\delta)) vertices and see if they induce a triangle. We note that:

  • The algorithm satisfies our requirement;
  • The running time is a constant independent of n and dependent only on \delta;
  • this all makes sense only if 1/\epsilon(\delta) grows moderately as a function of 1/\delta.

Let us now see that the Triangle Removal Lemma proves Szemeredi’s Theorem for sequences of length 3. As we discussed earlier, it is enough to prove the Theorem in groups of the type {\mathbf Z}_N, with N prime; we will do better and prove the Theorem in any abelian group. So let G be an abelian group of size N, an let A be a subset of G of size \delta N. Construct a tri-partite graph (X,Y,Z,E) where X,Y,Z are sets of vertices, each of size N, and each a copy of the group G. The set of edges is defined as follows:

  1. for x \in X and y \in Y, (x,y) is an edge if there is an a \in A such that x+a=y
  2. for x \in X and z \in Z, (x,z) is an edge if there is a b\in A such that x + b + b = z
  3. for y \in Y and z \in Z, (y,z) is an edge if there is a c \in A such that y+c=z

Now we notice that if x,y,z is a triangle, then there are a,b,c in A such that (after some cancellations) a+c = b+b, which means that a,b,c is an arithmetic progression of length 3 in the group G. In fact, we see that the number of triangles in the graph is precisely N times the number of triples (a,b,c) in arithmetic progression in A.

Consider now the N \cdot |A| = \delta N^2 triangles corresponding to the “trivial” progressions of the form a,a,a. (These are the triangles x,x+a,x+a+a.) We can see that these triangles are edge-disjoint, and so in order just to remove such triangles we have to remove from the graph at least \delta N^2 edges. So the Triangle Removal Lemma implies that there are at least \epsilon(\delta)N^3 triangles in the graph, and so at least \epsilon(\delta)N^2 triples in arithmetic progression in A. If N > \delta/\epsilon(\delta), then some of those arithmetic progressions must be non-trivial, and we have the length-3 case of Szemeredi’s theorem.

We see that both for the “property testing” application and for the Szemeredi Theorem application it is important to have good quantitative bounds on \epsilon(\delta). We know that, in Szemeredi’s theorem, N must be super-polynomial in 1/\delta, and so, in the Triangle Removal Lemma, 1/\epsilon(\delta) must be super-polynomial in 1/\delta.

The known bounds, however, are quite unreasonable: we only know how to bound 1/\epsilon(\delta) by a tower of exponentials whose height is poly(1/\delta). Unfortunately, such unreasonable bounds are unavoidable in any proof of the Triangle Removal Lemma based on Szemeredi’s Regularity Lemma. This is because, as proved by Gowers, such tower-of-exponentials bounds are necessary in the Regularity Lemma.

Can we find a different proof of the Triangle Removal Lemma that has only a singly- or doubly-exponential \epsilon(\delta)? That would be wonderful, both for property testing (because the proof would probably apply to other sub-graph homomorphism problems) and for additive combinatorics. Over the last year, I have found at least three such proofs. Unfortunately they were all fatally flawed.

Or, is there a more-than-exponential lower bound for the Triangle Removal Lemma? This would also be a major result: it would show that several results in property testing for dense graphs have no hope of being practical, and that there is a “separation” between the quantitative version of Szemeredi’s Theorem provable with analytical methods versus what can be proved with the above reduction. Besides, it’s not often that we have super-exponential lower bounds for natural problems, with Gowers’s result being a rare exception.

By the way, what about Szemeredi’s Theorem for longer sequences? For sequences of length k one needs a “k-clique removal lemma” for (k-1)-uniform hypergraphs, which in turn can be derived from the proper generalization of the Regularity Lemma to hypergraphs. This turns out to be quite complicated, and it has been accomplished only very recently in independent work by Nagle, Rödl and Schacht; and by Gowers. An alternative proof has been given by Tao. The interested reader can find more in expository paper by Gowers and Tao.

What about Szemeredi’s own proof? It does use the Regularity Lemma, which was conceived and proved specifically for this application, and it involves a reduction to a graph-theoretic problem. I have to admit that I don’t know much else.