Ellenberg’s announcement of a solution to the cap-set problem

Jordan Ellenberg has just announced a resolution of the “cap problem” using techniques of Croot, Lev and Pach, in a self-contained three-page paper. This is a quite unexpected development for a long-standing open problem in the core of additive combinatorics.

Perhaps the right starting point for this story is 1936, when Erdos and Turan conjectured that, for every {k}, if {A} is a subset of {\{1,\ldots, N\}} without {k}-terms arithmetic progressions, then {|A|= o_k(n)}, or, equivalently, that if {A} is a subset of the integers of positive density, then it must have arbitrarily long arithmetic progressions. Their goal in stating this conjecture was that resolving it would be a stepping stone to proving that the prime numbers have arbitrarily long arithmetic progressions. This vision came true several decades later. Szemeredi proved the conjecture in 1975, and Green and Tao proved that the primes contain arbitrarily long arithmetic progressions in 2004, with Szemeredi’s theorem being a key ingredient in their proof.

Rewinding a bit, the first progress on the Erdos-Turan conjecture came from Roth, who proved the {k=3} case In 1955. Roth’s proof establishes that if {A \subseteq \{ 1,\ldots, N\}} does not have length-3 arithmetic progressions, then {|A|} is at most, roughly {N/\log\log N}. Erdos also conjectured that the bound should be {o(N/\log N)}, and if this were true it would imply that the primes have infinitely many length-3 arithmetic progressions simply because of their density.

Roth’s proof uses Fourier analysis, and Meshulam, in 1995, noted that the proof becomes much cleaner, and it leads to better bounds, if one looks at the analog problem in {{\mathbb F}_p^n}, where {{\mathbb F}_p} is a finite field (of characteristic different from 2). In this case, the question is how big can {A\subseteq {\mathbb F}_p^n} be if it does not have three points on a line. An adaptation of Roth’s techniques gives an upper bound of the order of {p^n/n}, which, for constant {p}, is of the order of {N/\log N} if {N} is the size of the universe of which {A} is a subset.

Bourgain introduced a technique to work on {{\mathbb Z}/N{\mathbb Z}} “as if” it where a vector space over a finite field, and proved upper bounds of the order of {N/\sqrt {\log N}} and then {N/(\log N)^{3/4}} to the size of a subset of {\{ 1,\ldots , N\}} without length-3 arithmetic progressions. The latest result in this line is by Sanders, who proved a bound of {(N poly\log\log N)/\log N}, very close to Erdos’s stronger conjecture.

How far can these results be pushed? A construction of Behrend’s shows that there is a set {A\subseteq \{ 1,\ldots, N\}} with no length-3 arithmetic progression and size roughly {N/2^{\sqrt {\log N}}}. The construction is simple (it is a discretization of a sphere in {\sqrt {\log N}} dimensions) and it has some unexpected other applications. This means that the right bound in Roth’s theorem is of the form {N^{1-o(1)}} and that the “only” question is what is the {N^{-o(1)}} term.

In the finite vector space case, there is no analog of Behrend’s construction, and so the size of say, the largest subset of {{\mathbb F}_3^n} without three points on a line, was completely open, with an upper bound of the order of {3^n/n} and lower bounds of the order of {c^n} for some constant {c<3}. The cap problem was the question of whether the right bound is of the form {3^{(1-o(1)) }} or not.

Two weeks ago, Croot, Lev and Pach proved that if {A} is a subset of {({\mathbb Z}/4{\mathbb Z})^n} without length-3 arithmetic progressions, then {|A|} is at most of the order of {4^{.926 \cdot n}}. This was a strong indication that the right bound in the cap problem should be sub-exponential.

This was done a couple of days ago by Ellenberg, who proved an upper bound of the form {(2.756)^n} holds in {{\mathbb F}_3^n}. The proof is not specific to {{\mathbb F}_3} and generalizes to all finite fields.

Both proofs use the polynomial method. Roughly speaking, the method is to associate a polynomial to a set of interest (for example, by finding a non-zero low-degree polynomial that is zero for all points in the set), and then to proceed with the use of simple properties of polynomials (such as the fact that the space of polynomials of a certain degree has a bounded dimension, or that the set of zeroes of a univariate non-zero polynomial is at most the degree) applied either to the polynomial that we constructed or to the terms of its factorization.

Let {P_d} be the vector space of {n}-variate polynomials over {{\mathbb F}_3} of total degree {d} that are cube-free (that is, such that all variables occur in monomials with degree 0, 1, or 2), and let {m_d} be its dimension.

If {A\subseteq {\mathbb F}_3^n} is a set such that there are no distinct {a,b,c} such that {a+b+c=0} (a different property from being on a line, but the draft claims that the same argument works for the property of not having three points on a line as well), then Ellenberg shows that

\displaystyle  m_d - 3^n + |A| \leq 3 m_{d/2}

then the bound follows from computing that {m_{2n/3} \geq 3^n - c^n} and {m_{n/3} \leq c^n} for {c \approx 2.756\cdots}.

The finite field Kakeya problem is another example of a problem that had resisted attacks from powerful Fourier-analytic proofs, and was solved by Zeev Dvir with a relatively simple application of the polynomial method. One may hope that the method has not yet exhausted its applicability.

Gil Kalai has posted about further consequence of the results of Croot, Lev, Pach and Ellenberg.

The Triangle Removal Lemma

[At the end of a survey paper on additive combinatorics and computational complexity which is to appear in SIGACT News, I list three major open questions in additive combinatorics which might be amenable to a “computer science proof.” They are all extremely well studied questions, by very smart people, for the past several years, so they are all very long shots. I don’t recommend anybody to start working on them, but I think it is good that as many people as possible know about these questions, because when the right technique comes along its applicability can be more quickly realized.]

The first question is to improve the Triangle Removal Lemma. I have talked here about what the triangle removal lemma is, how one can prove it from the Szemerédi Regularity Lemma, and how it implies the length-3 case of Szemerédi’s Theorem.

As a short recap, the Triangle Removal Lemma states that if {G} is an {n}-vertex graph with {o(n^3)} triangles, then there is a set of {o(n^2)} edges such that the removal of those edges eliminates all the triangles. Equivalently, it says that if a graph has {\Omega(n^2)} triangles which are all pair-wise edge-disjoint, then there must be {\Omega(n^3)} triangles overall.

The connection with Szemerédi’s Theorem is that if {H} is an abelian group with {n} elements, and {A} is a subset of {H} with no length-3 arithmetic progressions (i.e., {A} is such that there are no three distinct elements {a,b,c} in {A} such that {b-a = c-b}), then we can construct a graph {G=(V,E)} that has {3n} vertices, {|A| \cdot n} pair-wise edge-disjoint triangles, and no other triangles. This contradicts the triangle removal lemma if {|A| = \Omega(n)}, and so we must have {|A| = o(n)}.

This is great, until we start looking at the relationships between the constants hidden by the {o(\cdot )} notation. Quantitatively, the Triangle Removal Lemma states that for every {\epsilon} there is a {\delta = \delta(\epsilon)} such that if a graph has at least {\epsilon \cdot n^2} pair-wise edge-disjoint triangles, then it has at least {\delta \cdot n^3} triangles. The only known proof, however, has {\delta} incredibly small: {1/\delta} grows like a tower of exponentials of height polynomial in {1/\epsilon}. The proof uses the Szemerédi Regularity Lemma, and the Regularity Lemma is known to require such very bad dependencies.

63 years ago, Behrend showed that {{\mathbb Z}/N{\mathbb Z}}, {N} prime, has a subset {A} that contains no length-3 arithmetic progression and whose size is {N/2^{O(\sqrt {\log N})}}. (Last year, Elkin gave the first improvement in 62 years to Behrend’s bound, but the improvement is only a multiplicative polylog {N} factor.) Combined with the graph construction mentioned above, this gives a graph with {3N} vertices, {N^2/^{O(\sqrt {\log N})}} edge-disjoint triangles, and no other triangle. Thus, the graph has {\leq \delta N^3} triangles where {\delta < 1/N}, but one needs to remove {> \epsilon N^2} edges to make it triangle-free, where {\epsilon > 2^{-O(\sqrt{\log N})}}. This shows that, in the Triangle Removal Lemma, {1/\delta} must grow super-polynomially in {1/\epsilon}, and be at least {1/\epsilon^{\log 1/\epsilon}}.

The question is to shorten the gap between the tower-of-exponential relationship between {1/\delta} and {1/\epsilon} coming from the proof via the Szemerédi Regularity Lemma and the mildly super-polynomial lower bound coming from the above argument.

Continue reading

Earliest Connections of Additive Combinatorics and Computer Science

I am writing a short survey on connections between additive combinatorics and computer science for SIGACT News and I have been wondering about the “history” of the connections. (I will be writing as little as possible about history in the SIGACT article, because I don’t have the time to research it carefully, but if readers help me out, I would write about it in a longer article that I plan to write sometime next year.)

Now, of course, there is always that Russian paper from the 1950s . . .

Then there is the Coppersmith-Winograd matrix multiplication algorithm, which uses Behrend’s construction of large sets of integers with no length-3 arithmetic progressions. (They don’t need the full strength of Behrend’s construction, and the weaker construction of Salem and Spencer suffices.) As far as I know, this was an isolated application.

My understanding (which is quite possibly mistaken) is that there are three main threads to the current interactions: one concerning the Szemeredi Regularity Lemma and the “triangle removal lemma” of Ruzsa and Szemeredi; one concerning “sum-product theorems” and the Kakeya problem; and one concerning the Gowers uniformity norms. The earliest connection is the first one, and, depending how one is counting, started in 1992 or 2001, and is due, respectively, to Alon et al. or just to Alon.
Continue reading

Applications of Low-Complexity Approximations

In the last post, I stated the following generalization of the weak regularity lemma:

Theorem (Low Complexity Approximation, TTV Version) Let (X,\mu) be a probability space, g:X \rightarrow [-1,1] a bounded function, F a collection of bounded functions f:X \rightarrow [ -1,1], and \epsilon an approximation parameter.

Then there is a function h: X \rightarrow [-1,1] such that

  • h has low complexity relative to F: there are k= O( \epsilon^{-2}) functions f_i\in F and coefficients c_i such that

    \displaystyle h(x) := \max \{ -1 , \min \{ 1, \sum_{i=1}^k c_i f_i (x)  \}\}

  • h is \epsilon-indistinguishable from g by F, that is,

    \forall f.F \ \ \left| {\mathbb E}_{x \sim \mu} f(x) g(x) - {\mathbb E}_{x \sim \mu} f(x) h(x) \right|  \leq \epsilon

(Last time, I mentioned that our proof handled only boolean functions f; now we can handle arbitrary bounded functions, and with an “energy-decrease” style proof, this will appear in the next online revision of the paper.)

This seems to be a useful tool, limited only by one’s creativity in choosing the functions F and then making use of the properties of h.

As already discussed,

  • if one takes X to be the edges of a complete graph, and F the set of indicator variables of cuts, then the existence of h gives the weak regularity lemma of Frieze and Kannan; and
  • if one takes F to be the set of circuits of size at most S(n), and normalizes g and h to be probability distributions, one gets that for every probability distribution D of high entropy there is a (non-uniformly) efficiently samplable and computable distribution M that is indistinguishable from D by circuits of size \leq S(n).

In this post I’ll show how to use it to give short proofs of the Hardcore Lemma of Impagliazzo and the Dense Model Theorem of Green, Tao and Ziegler. Both proofs also have, at least in hindsight, a sense of “inevitability,” meaning that given the Low-Complexity Approximation Theorem, and given what we want to prove, both proofs get to the point in a most economical and natural way.

  1. The Impagliazzo Hardcore Lemma. We have already mentioned that if g is “hard-on-average” for F, then h cannot be an approximation in the sense of being close to g on most inputs. What, then, about the points on which g and h differ? They form an Impagliazzo Hardcore Set for g, as described next.

    Let g: X \rightarrow \{-1,1\} be a function that is weakly hard on average for a class of algorithms F. Suppose, specifically, that for every algorithm h of complexity O(\epsilon^{-2}\delta^{-2}) relative to F we have

    {\mathbb P} [ g(x) = h(x) ] \leq 1-\delta

    and, more generally, for fractional h, we have

    (1) \ \ \ {\mathbb E} | g(x) - h(x) | \geq 2\delta

    Then, construct an approximating function h of complexity
    \leq \epsilon^{-2}\delta^{-2} relative to F and such that h and g are \epsilon\delta-indistinguishable by F. Note that, even though h is “indistinguishable” from g, it is also “far” from g, as in (1).

    Define a probability distribution H that assigns to point x a probability proportional to |g(x)-h(x)|. (If h were boolean, this would be the uniform distribution over points on which h and g differ.) Then this distribution is \delta-dense in the uniform distribution, meaning that every point has probability at most (\delta |X|)^{-1}. Observe also that we have

    g(x)\cdot |g(x)-h(x)| = g(x)-h(x)

    for every x, because g(x) and g(x)-h(x) have the same sign and |g(x)|=1, so we have

    \begin{array}{ll} \displaystyle {\mathbb E}_H g(x)f(x) & \displaystyle = \sum_x \frac {|g(x)-h(x)|}{\sum_y |g(y)-h(y)|} \cdot g(x)f(x) \\ & \displaystyle = \frac {|X|}{\sum_y |g(y)-h(y)|} \cdot {\mathbb E}_X |g(x)-h(x)|g(x)f(x) \\ & \displaystyle \leq \frac {1}{\delta} {\mathbb E}_X  (g(x)-h(x))f(x)  \\ & \displaystyle \leq \epsilon \end{array}

    and so H is a hardcore distribution, because the above expression is equivalent to

    {\mathbb P}_H [ g(x) = f(x) ] \leq \frac 12 + \frac \epsilon 2

  2. The Dense Model Theorem. Suppose that R\subseteq X is a pseudorandom set with respect to functions that have bounded complexity relative to F, and let D\subseteq R be a dense subset of R, |D| = \delta |R|.

    To find a dense model of D, we take g to be the characteristic function of D, and we let h be the low-complexity approximation, but using the uniform distribution on R as \mu.. Now suppose for simplicity that h is boolean, and that M is the set of inputs of X on which h is 1. We want to argue that M is a dense model of D. By assuming without loss of generality that F contains the all-one function, we get from the indistinguishability of g and h that

    \delta = {\mathbb E}_R g \approx {\mathbb E}_R h

    and from the pseudorandomness of R we have

    {\mathbb E}_R h \approx {\mathbb E}_X h = |M|/|X|

    and so |M| \approx \delta |X| and M is indeed dense.

    For the indistinguishability of M and D, take any function f, and observe that

    {\mathbb E}_M f = \delta^{-1} {\mathbb E}_R fg \approx \delta^{-1} {\mathbb E}_R fh \approx \delta^{-1} {\mathbb E}_X fh \approx {\mathbb E}_M f

    where we use both the indistinguishability of g and h under distribution R, and the fact that the distributions R and X are indistinguishable by functions of bounded complexity.

    This proof is appealingly intuitive, in the sense that if we expect D to be indistinguishable from a large set, then when we try to approximate the characteristic function of D we will end up with a low complexity function that is spread around much of X, thus defining a dense model. It also shows that “relative” versions of the Regularity Lemma, such as the Regularity Lemma for subgraphs of an expander, may be derived from the regular Lemma by the above argument. A disadvantage of the argument is that it does not establish the stronger form of the Dense Model Theorem suggested by Impagliazzo, in which there is no set R, but we require D to have the “pseudo-density” requirement that for every low-complexity bounded function f,

    {\mathbb E}_{x\in D} |f(x)| \leq \frac 1 {\delta} {\mathbb E}_{X} |f(x)| + \epsilon

    which follows immediately if D has density \delta in a pseudorandom set R, but that is a seemingly weaker property. (The relative Regularity Lemma in graphs had long be known to hold under such a pseudo-density assumption.)

Boosting and the Regularity Lemma

In a previous post, I described abstract forms of the weak regularity lemma, in which we start from an arbitrary bounded function g: X \rightarrow [-1,1] and an arbitrary set F of “structured” bounded functions f: X \rightarrow [-1,1], and we construct an “approximatng” function h that has “low complexity” relative to F and is “indistinguishable” from g by functions from F. We had two proofs, both generalizations of arguments due to Frieze and Kannan: in one proof h is not bounded, but its complexity is O(\epsilon^{-2}), where \epsilon is the approximation parameter; in the second proof, h is bounded but it has exponential complexity exp(O(\epsilon^{-2}) in the approximation parameter.

In a new paper by Madhur Tulsiani, Salil Vadhan, and I, we give a new proof that gives an approximating function that, at the same time, is bounded and has low complexity. This has a number of applications, which I will describe in a future post.

(Note that, in the statement below, F is required to contain Boolean functions, rather than bounded ones. It is possible, however, to “reduce” the bounded case to the boolean case via the following observation: if F is a family of bounded functions, F' is the family of boolean functions \{ 1_{f(x) \geq t} \ : \ f\in F, t\in [-1,1] \}, and g and h are \epsilon-indistinguishable according to F', then they are also \epsilon-indistinguishable according to F.)

Theorem (Low Complexity Approximation, TTV Version) Let (X,\mu) be a probability space, g:X \rightarrow [-1,1] a bounded function, F a collection of boolean functions f:X \rightarrow \{ -1,1\}, and \epsilon an approximation parameter.

Then there is a function h: X \rightarrow [-1,1] such that

  • h has low complexity relative to F: there are k= O( \epsilon^{-2}) functions f_i\in F and coefficients c_i such that

    \displaystyle h(x) := \max \{ -1 , \min \{ 1, \sum_{i=1}^k c_i f_i (x)  \}\}

  • h is \epsilon-indistinguishable from g by F, that is,

    \forall f.F \ \ \left| {\mathbb E}_{x \sim \mu} f(x) g(x) - {\mathbb E}_{x \sim \mu} f(x) h(x) \right|  \leq \epsilon

That is, h is simply a linear combination of functions from F, whose value is truncated to be between -1 and 1.

Recall that, when we do not require h to be bounded, h can be constructed via the following algorithm:

Algorithm FK

  • h_0 := 0; t= 0
  • while \exists f_{t+1} \in F such that {\mathbb E} f_{t+1} \cdot (g-h_t) \geq \epsilon
    • h_{t+1} := \sum_{i=1}^{t+1} \gamma f_{i}
    • t:=t+1
  • return h_t

And the analysis shows that, if we call \Delta_t := g - h_t, then {\mathbb E} \Delta_t^2 - \Delta_{t+1}^2 \geq  2\epsilon\gamma - \gamma^2. Setting \gamma := \epsilon shows that the algorithm stops within \epsilon^{-2} steps.

Our proof proceeds by doing exactly the same, but making sure at every step that h is bounded.

Algorithm TTV

  • h_0 := 0; t= 0
  • while \exists f_{t+1} \in F such that {\mathbb E} f_{t+1} \cdot (g-h_t) \geq \epsilon
    • h_{t+1} := \max \{ -1, \min \{ 1, \sum_{i=1}^{t+1} \gamma f_i \} \}
    • t:=t+1
  • return h_t

The problem with the old analysis is that now we could conceivably have a step in which \sum_{i=1}^t \gamma f_i (x) > 1+\gamma for all x, and so h_t = h_{t+1}, and thus \Delta_t = \Delta_{t+1}, and we have no energy decrease. To get to such a state, however, there must have been about 1/\gamma steps in which h changed in the same way as in the Frieze-Kannan case. I think there should be a sort of “amortized analysis” that would “charge” the lack of change of h at certain steps to other steps in which it indeed changes, and establish that, for every time step T,

{\mathbb E} \Delta_0^2 - \Delta_T^2 \geq \Omega(T \cdot \gamma \cdot (\epsilon - O(\gamma))

Unfortunately I don’t know how to construct such a proof. Our proof, instead, follows step by step Impagliazzo’s proof of the Impagliazzo Hardcore Set Lemma, which employs a very similar algorithm (in a quite different context). As shown by Klivans and Servedio, the algorithm in Impagliazzo’s proof can be seen as a boosting algorithm in learning theory (and every other known boosting algorithm can be used to prove Impagliazzo’s Hardcore Lemma); this is why we think of our proof as a “boosting” proof.

I must confess, I have never really understood Impagliazzo’s proof, and so I can’t say I really understand ours, other than being able to reproduce the steps.

The idea is to consider the quantity f_{t+1}(x) \Delta_{t}(x), which depends on t and x, and to see how it behaves summed over t, for a fixed x, and how it behaves for a fixed t, summed over x.

Suppose the algorithm is still running after T steps. Then, because of the way we define the termination condition, we have, for every t \in \{ 0,\ldots, T-1\}

\displaystyle (1) \ \ \ {\mathbb E}_x f_{t+1}(x)  (g(x)-h_t(x)) \geq  \epsilon

and the crux of the proof is to show that for every x we have

(2) \displaystyle \ \ \ \sum_{i=0}^{T-1} f_{t+1}(x)  (g(x)-h_t(x)) \leq \frac \gamma 2 \cdot T  + \frac 4 {\gamma}

So if we sum (1) over t and average (2) over x, we get

\displaystyle  \epsilon T \leq  \frac  \gamma 2 \cdot T  + \frac 4 {\gamma}

and setting \gamma := \epsilon gives T \leq 8 \epsilon^{-2}.

Inequality (2), the main step of the proof, is the part for which I have little intuition. One breaks the summation into groups of time steps, depending on the value of \Delta_t; there are 2/\gamma groups (because the value of h_t changes by discrete increments of \gamma, and is between -1 and +1) and each one is shown to contribute O(\gamma T') + O(1), where T' is the number of time steps in the group.

It is perhaps instructive to translate the Frieze-Kannan proof to this set-up. In the Frieze-Kannan algorithm, we have

\displaystyle  f_{t+1}(x)  (g(x)-h_t(x))  = \frac 1 {2\gamma} \cdot \left( \Delta_{t}^2 - \Delta^2_{t+1} +\gamma^2 f_{t+1}^2 \right)

and so

\begin{array}{ll} \displaystyle  \sum_{t=0}^{T-1} f_{t+1}(x)  (g(x)-h_t(x)) & = \displaystyle \frac 1{2\gamma} (\Delta_0^2 - \Delta_{T}^2) + \frac {\gamma}2 \cdot \sum_{t=1}^{T} f_t^2(x) \\  & \leq \displaystyle  \frac 1 {2\gamma}  + \frac \gamma 2 \cdot  T \end{array}

which is analogous to our inequality (2).

Decomposition Results and Regularity Lemmas

Several results in additive combinatorics have the flavor of decomposition results in which one shows that an arbitrary combinatorial object can be approximated by a “sum” of a “low complexity” part and a “pseudorandom” part. This is helpful because it reduces the task of proving that an arbitrary object has a certain property to the easier task of proving that low-complexity objects and pseudorandom objects have the property. (Provided that the property is preserved by the approximation and the sum.)

Many examples of such results are given in the paper accompanying Terry Tao’s tutorial at FOCS’07.

Usually, one needs a strong notion of approximation, and in such cases the complexity of the “low complexity” part is fantastically large (a tower of exponentials in the approximation parameter), as in the Szemeredi Regularity Lemma, which can be seen as a prototypical such “decomposition” result. In some cases, however, weaker notions of approximation are still useful, and one can have more reasonable complexity parameters, as in the Weak Regularity Lemma of Frieze and Kannan.

A toy example of a “weak decomposition result” is the following observation: Continue reading

Ergodic Theory and Additive Combinatorics

As mentioned before, the MSRI is devoting this semester to connections between additive combinatorics and ergodic theory.

One of the generally exciting things about additive combinatorics is the confluence of techniques from analysis, combinatorics, and ergodic theory. One of the specifically exciting things to me has been the applicability of the analytic and the combinatorial techniques to computer science. It makes sense that we should make some effort to understand the ergodic-theoretic techniques too, just in case.

So what is ergodic theory? Continue reading

The Inverse Conjecture for the Gowers Norms

This semester, the MSRI is having a special semester devoted to additive combinatorics and ergodic theory.

Additive combinatorics is the branch of extremal combinatorics where the objects of interest are sets of integers and, more generally, subsets of abelian groups (rather than graphs or set systems) and where the properties of interest are formulated in terms of linear equations (rather than in terms of cuts, partitions, intersections, and so on). Lately, it has been quite fascinating for a convergence of methods from “classical” combinatorics, from analysis and from ergodic theory. I have often written about it here because the combinatorial and analytical techniques of additive combinatorics have been useful in a number of different computer science applications (related to probabilistically checkable proof constructions, property testing, and pseudorandomness), and computer scientists are becoming more and more interested in the subject, contributing their own perspective to it.

In all this, the exact role of ergodic theory (and the possible applications of ergodic-theoretic techniques in complexity theory) has been a bit of mystery to me, and perhaps to other computer scientists too. Very nicely, the MSRI special program started this week with a series of tutorials to introduce the connections between ergodic theory and additive combinatorics.

All talks are (or will soon be) online, and the talks by Terry Tao are especially recommended, because he explains, using very simple examples, how one goes about converting concrete, finitary and quantitative statements into equivalent abstract infinitary statements, which are in turn amenable to ergodic-theoretic techniques.

Today, Tamar Ziegler discussed the very recent proof, by Vitaly Bergelson, Terry Tao, and herself, of the inverse conjecture for Gowers norms in finite fields, a proof that uses ergodic-theoretic techniques.

But, you may object, didn’t we know that the inverse conjecture for Gowers norms is false? Well, the counterexample of Lovett, Meshulam, and Samorodnitsky (independently discovered by Green and Tao) refutes the following conjecture: “Suppose f: {\mathbb F}_p^n \rightarrow {\mathbb C} is a bounded function such that || f||_{U^k} \geq \delta; then there is a \epsilon = \epsilon(\delta,p,k) independent of n and an n-variate degree-(k-1) polynomial q over {\mathbb F}_p such that f(x) and e^{-2\pi i q(x)/p} have correlation at least \epsilon.”

This is refuted in the p=2,k=4 case by taking f(x_1,\ldots,x_n) = (-1)^{S_4(x_1,\ldots,x_n)}, where S_4 is the symmetric polynomial of degree 4. While f has o(1) (indeed, exponentially small) correlation with all functions of the form (-1)^{q(x_1,\ldots,x_n)}, where q is a degree-3 polynomial, the norm || f||_{U^4} is a constant.

The statement proved by Bergelson, Tao, and Ziegler is “Suppose f: {\mathbb F}_p^n \rightarrow {\mathbb C} is a bounded function such that || f||_{U^k} \geq \delta; then there is a \epsilon = \epsilon(\delta,p,k) independent of n and a bounded n-variate degree-(k-1) ‘polynomial function’ Q: {\mathbb F}_p^n \rightarrow {\mathbb C} such that f(x) and Q(x) have correlation at least \epsilon.”

What is, then, a ‘polynomial function’ of degree d? It is a function bounded by 1 in absolute value, and such that for every d+1 directions h_1,\ldots, h_{d+1}, if one takes d+1 Gowers derivatives in such directions one always gets the constant-1 function. In other words, Q is a ‘polynomial function’ of degree k-1 if |Q(x)| \leq 1 for every x, and one has || Q||_{U^k}=1. Interestingly, these functions are a proper superclass of the functions of the form e^{\pi i q(x)/p} with q being a polynomial over {\mathbb F}_p.

In the concrete p=2 case, one may construct such a function by letting q be a polynomial in the ring, say, {\mathbb Z}_8, and then having Q(x) = \omega^{q(x)}, where \omega is a primitive eight-root of unity. Indeed, this is the type of degree-3 polynomial function that is correlated with (-1)^{S_4(x)}.

[Apologies for not defining all the technical terms and the context; the reader can find some background in this post and following the links there.]

What is, then, ergodic theory, and what does it have to do with finitary combinatorial problems? I am certainly the wrong person to ask, but I shall try to explain the little that I have understood in the next post(s).

Dense Subsets of Pseudorandom Sets: The Paper(s)

Green, Tao and Ziegler, in their works on patterns in the primes, prove a general result of the following form: if X is a set, R is a, possibly very sparse, “pseudorandom” susbset of X, and D is a dense subset of R, then D may be “modeled” by a large set M which has the same density in X as the density of D in R.

They use this result with X being the integers in a large interval \{ 1,\ldots, N\}, R being the “almost-primes” in X (integers with no small factor), and D being the primes in X. Since the almost-primes can be proved to be “pseudorandom” in a fairly strong sense, and since the density of the primes in the almost-primes is at least an absolute constant, it follows that the primes are “indistinguishable” from a large set M containing a constant fraction of all integers. Since such large sets are known to contain arbitrarily long arithmetic progressions, as proved by Szemeredi, Green and Tao are able to prove that the primes too must contain arbitrarily long arithmetic progressions. Such large sets are also known to contain arbitrarily long “polynomial progressions,” as proved by Bergelson and Leibman, and this allows Tao and Ziegler to argue that the primes too much contain arbitrarily long polynomial progressions.

(The above account is not completely accurate, but it is not lying too much.)

As announced last October here, and here, Omer Reingold, Madhur Tulsiani, Salil Vadhan and I found a new proof of this “dense model” theorem, which uses the min-max theorem of game theory (or, depending on the language that you prefer to use, the duality of linear programming or the Hahn Banach theorem) and was inspired by Nisan’s proof of the Impagliazzo hard-core set theorem. In complexity-theoretic applications of the theorem, our reduction has polynomial complexity, while the previous work incurred an exponential loss.

As discussed here and here, we also show how to use the Green-Tao-Ziegler techniques of “iterative partitioning” to give a different proof of Impagliazzo’s theorem.

After long procrastination, we recently wrote up a paper about these results.

In the Fall, we received some feedback from additive combinatorialists that while our proof of the Green-Tao-Ziegler result was technically simpler than the original one, the language we used was hard to follow. (That’s easy to believe, because it took us a while to understand the language in which the original proof was written.) We then wrote an expository note of the proof in the analyst’s language. When we were about to release the paper and the note, we were contacted by Tim Gowers, who, last Summer, had independently discovered a proof of the Green-Tao-Ziegler results via the Hahn-Banach theorem, essentially with the same argument. (He also found other applications of the technique in additive combinatorics. The issue of polynomial complexity, which does not arise in his applications, is not considered.)

Gowers announced his results in April at a talk at the Fields institute in Toronto. (Audio and slides are available online.)

Gowers’ paper already contains the proof presented in the “analyst language,” making our expository note not so useful any more; we have still posted it anyways because, by explaining how one translates from one notation to the other, it can be a short starting point for the computer scientist who is interested in trying to read Gowers’ paper, or for the combinatorialist who is interested in trying to read our paper.

Dense Subsets of Pseudorandom Sets

The Green-Tao theorem states that the primes contain arbitrarily long arithmetic progressions; its proof can be, somewhat inaccurately, broken up into the following two steps:

Thm1: Every constant-density subset of a pseudorandom set of integers contains arbitrarily long arithmetic progressions.

Thm2: The primes have constant density inside a pseudorandom set.

Of those, the main contribution of the paper is the first theorem, a “relative” version of Szemeredi’s theorem. In turn, its proof can be (even more inaccurately) broken up as

Thm 1.1: For every constant density subset D of a pseudorandom set there is a “model” set M that has constant density among the integers and is indistinguishable from D.

Thm 1.2 (Szemeredi) Every constant density subset of the integers contains arbitrarily long arithmetic progressions, and many of them.

Thm 1.3 A set with many long arithmetic progressions cannot be indistinguishable from a set with none.

Following this scheme is, of course, easier said than done. One wants to work with a definition of pseudorandomness that is weak enough that (2) is provable, but strong enough that the notion of indistinguishability implied by (1.1) is in turn strong enough that (1.3) holds. From now on I will focus on (1.1), which is a key step in the proof, though not the hardest.

Recently, Tao and Ziegler proved that the primes contain arbitrarily long “polynomial progressions” (progressions where the increments are given by polynomials rather than linear functions, as in the case of arithmetic progressions). Their paper contains a very clean formulation of (1.1), which I will now (accurately, this time) describe. (It is Theorem 7.1 in the paper. The language I use below is very different but equivalent.)

We fix a finite universe \Sigma; this could be \{ 0,1\}^n in complexity-theoretic applications or {\mathbb Z}/N{\mathbb Z} in number-theoretic applications. Instead of working with subsets of \Sigma, it will be more convenient to refer to probability distributions over \Sigma; if S is a set, then U_S is the uniform distribution over S. We also fix a family F of “easy” function f: \Sigma \rightarrow [0,1]. In a complexity-theoretic applications, this could be the set of boolean functions computed by circuits of bounded size. We think of two distributions X,Y as being \epsilon-indistinguishable according to F if for every function f\in F we have

| E [f(X)] - E[f(Y)] | \leq \epsilon

and we think of a distribution as pseudorandom if it is indistinguishable from the uniform distribution U_\Sigma. (This is all standard in cryptography and complexity theory.)

Now let’s define the natural analog of “dense subset” for distributions. We say that a distribution A is \delta-dense in B if for every x\in \Sigma we have

Pr [ B=x] \geq \delta Pr [A=x]

Note that if B=U_T and A=U_S for some sets S,T, then A is \delta-dense in B if and only if S\subseteq T and |S| \geq \delta |T|.

So we want to prove the following:

Theorem (Green, Tao, Ziegler)
Fix a family F of tests and an \epsilon>0; then there is a “slightly larger” family F' and an \epsilon'>0 such that if R is an \epsilon'-pseudorandom distribution according to F' and D is \delta-dense in R, then there is a distribution M that is \delta-dense in U_\Sigma and that is \epsilon-indistinguishable from D according to F.

[The reader may want to go back to (1.1) and check that this is a meaningful formalization of it, up to working with arbitrary distributions rather than sets. This is in fact the “inaccuracy” that I referred to above.]

In a complexity-theoretic setting, we would like to say that if F is defined as all functions computable by circuits of size at most s, then \epsilon' should be poly (\epsilon,\delta) and F' should contain only functions computable by circuits of size s\cdot poly(1/\epsilon,1/\delta). Unfortunately, if one follows the proof and makes some simplifications asuming F contains only boolean functions, one sees that F' contains functions of the form g(x) = h(f_1(x),\ldots,f_k(x)), where f_i \in F, k = poly(1/\epsilon,1/\delta), and h could be arbitrary and, in general, have circuit complexity exponential in 1/\epsilon and 1/\delta. Alternatively one may approximate h() as a low-degree polynomial and take the “most distinguishing monomial.” This will give a version of the Theorem (which leads to the actual statement of Thm 7.1 in the Tao-Ziegler paper) where F' contains only functions of the form \Pi_{i=1}^k f_i(x), but then \epsilon' will be exponentially small in 1/\epsilon and 1/\delta. This means that one cannot apply the theorem to “cryptographically strong” notions of pseudorandomness and indistinguishability, and in general to any setting where 1/\epsilon and 1/\delta are super-logarithmic (not to mention super-linear).

This seems like an unavoidable consequence of the “finitary ergodic theoretic” technique of iterative partitioning and energy increment used in the proof, which always yields at least a singly exponential complexity.

Omer Reingold, Madhur Tulsiani, Salil Vadhan and I have recently come up with a different proof where both \epsilon' and the complexity of F' are polynomial. This gives, for example, a new characterization of the notion of pseudoentropy. Our proof is quite in the spirit of Nisan’s proof of Impagliazzo’s hard-core set theorem, and it is relatively simple. We can also deduce a version of the theorem where, as in Green-Tao-Ziegler, F' contains only bounded products of functions in F. In doing so, however, we too incur an exponential loss, but the proof is somewhat simpler and demonstrates the applicability of complexity-theoretic techniques in arithmetic combinatorics.

Since we can use (ideas from) a proof of the hard core set theorem to prove the Green-Tao-Ziegler result, one may wonder whether one can use the “finitary ergodic theory” techniques of iterative partitioning and energy increment to prove the hard-core set theorem. Indeed, we do this too. In our proof, the reduction loses a factor that is exponential in certain parameters (while other proofs are polynomial), but one also gets a more “constructive” result.

If readers can stomach it, a forthcoming post will describe the complexity-theory-style proof of the Green-Tao-Ziegler result as well as the ergodic-theory-style proof of the Impagliazzo hard core set theorem.