CS276 Lecture 3: Pseudorandom Generators

Scribed by Bharath Ramsundar

Summary

Last time we introduced the setting of one-time symmetric key encryption, defined the notion of semantic security, and proved its equivalence to message indistinguishability.

Today we complete the proof of equivalence (found in the notes for last class), discuss the notion of pseudorandom generator, and see that it is precisely the primitive that is needed in order to have message-indistinguishable (and hence semantically secure) one-time encryption. Finally, we shall introduce the basic definition of security for protocols which send multiple messages with the same key.

1. Pseudorandom Generators And One-Time Encryption

Intuitively, a Pseudorandom Generator is a function that takes a short random string and stretches it to a longer string which is almost random, in the sense that reasonably complex algorithms cannot differentiate the new string from truly random strings with more than negligible probability.

Definition 1 [Pseudorandom Generator] A function {G: \{ 0,1 \}^k \rightarrow \{ 0,1 \}^m} is a {(t,\epsilon)}-secure pseudorandom generator if for every boolean function {T} of complexity at most {t} we have

\displaystyle   \left | {\mathbb P}_{x\sim U_k } [ T(G(x)) = 1] - {\mathbb P} _{x\sim U_m} [ T(x) = 1] \right| \leq \epsilon \ \ \ \ \ (1)

(We use the notation {U_n} for the uniform distribution over {\{ 0,1 \}^n}.)

The definition is interesting when {m> k} (otherwise the generator can simply output the first m bits of the input, and satisfy the definition with {\epsilon=0} and arbitrarily large {t}). Typical parameters we may be interested in are {k=128}, {m=2^{20}}, {t=2^{60}} and {\epsilon = 2^{-40}}, that is we want {k} to be very small, {m} to be large, {t} to be huge, and {\epsilon} to be tiny. There are some unavoidable trade-offs between these parameters.

Lemma 2 If {G: \{ 0,1 \}^k \rightarrow \{ 0,1 \}^m} is {(t,2^{-k-1})} pseudorandom with {t = O(m)}, then {k\geq m-1}.

Proof: Pick an arbitrary {y \in \{ 0,1 \}^k}. Define

\displaystyle  T_y(x) = 1 \Leftrightarrow x = G(y)

It is clear that we may implement {T} with an algorithm of complexity {O(m)}: all this algorithm has to do is store the value of {G(y)} (which takes space {O(m)}) and compare its input to the stored value (which takes time {O(m)}) for total complexity of {O(m)}. Now, note that

\displaystyle {\mathbb P}_{x\sim U_k } [ T(G(x)) = 1] \geq \frac{1}{2^k}

since {G(x) = G(y)} at least when {x = y}. Similarly, note that {{\mathbb P} _{x\sim U_m} [ T(x) = 1] = \frac{1}{2^m}} since {T(x) = 1} only when {x = G(y)}. Now, by the pseudorandomness of {G}, we have that {\frac{1}{2^k} - \frac{1}{2^m} \leq \frac{1}{2^{k+1}}}. With some rearranging, this expression implies that

\displaystyle \frac{1}{2^{k+1}} \leq \frac{1}{2^m}

which then implies {m \leq k + 1 } and consequently {k \geq m - 1}

Exercise 1 Prove that if {G: \{ 0,1 \}^k \rightarrow \{ 0,1 \}^m} is {(t,\epsilon)} pseudorandom, and {k < m}, then

\displaystyle  t \cdot \frac 1 \epsilon \leq O( m \cdot 2^k)

Suppose we have a pseudorandom generator as above. Consider the following encryption scheme:

  • Given a key {K\in \{ 0,1 \}^k} and a message {M \in \{ 0,1 \}^m},

    \displaystyle  Enc(K,M) := M \oplus G(K)

  • Given a ciphertext {C\in \{ 0,1 \}^m} and a key {K\in \{ 0,1 \}^k},

    \displaystyle  Dec(K,C) = C \oplus G(K)

(The XOR operation is applied bit-wise.)

It’s clear by construction that the encryption scheme is correct. Regarding the security, we have

Lemma 3 If {G} is {(t,\epsilon)}-pseudorandom, then {(Enc,Dec)} as defined above is {(t-m,2\epsilon)}-message indistinguishable for one-time encryption.

Proof: Suppose that {G} is not {(t-m, 2\epsilon)}-message indistinguishable for one-time encryption. Then {\exists} messages {M_1, M_2} and {\exists} algorithm {T} of complexity at most {t - m} such that

\displaystyle  \left | {\mathbb P}_{K \sim U_k} [T(Enc(K, M_1)) = 1] - {\mathbb P}_{K \sim U_k} [T(Enc(K, M_2)) = 1] \right | > 2\epsilon

By using the definition of {Enc} we obtain

\displaystyle  \left | {\mathbb P}_{K \sim U_k} [T(G(K) \oplus M_1)) = 1] - {\mathbb P}_{K \sim U_k} [T(G(K) \oplus M_2)) = 1] \right | > 2\epsilon

Now, we can add and subtract the term {{\mathbb P}_{R \sim U_m} [T(R) = 1]} and use the triangle inequality to obtain that {\left | {\mathbb P}_{K \sim U_k} [T(G(K) \oplus M_1) = 1] - {\mathbb P}_{R \sim U_m} [T(R) = 1] \right |} added to {\left | {\mathbb P}_{R \sim U_m} [T(R) = 1] - {\mathbb P}_{K \sim U_k} [T(G(K) \oplus M_2) = 1] \right |} is greater than {2\epsilon}. At least one of the two terms in the previous expression must be greater that {\epsilon}. Suppose without loss of generality that the first term is greater than {\epsilon}

\displaystyle  \left | {\mathbb P}_{K \sim U_k} [T(G(K) \oplus M_1)) = 1] - {\mathbb P}_{R \sim U_m} [T(R) = 1] \right | > \epsilon

Now define {T'(X) = T(X \oplus M_1)}. Then since {H(X) = X \oplus M_1} is a bijection, {{\mathbb P}_{R \sim U_m} [T'(R) = 1] = {\mathbb P}_{R \sim U_m} [T(R) = 1]}. Consequently,

\displaystyle  \left | {\mathbb P}_{K \sim U_k} [T'(G(K)) = 1] - {\mathbb P}_{R \sim U_m} [T'(R) = 1] \right | > \epsilon

Thus, since the complexity of {T} is at most {t - m} and {T'} is {T} plus an xor operation (which takes time {m}), {T'} is of complexity at most {t}. Thus, {G} is not {(t, \epsilon)}-pseudorandom since there exists an algorithm {T'} of complexity at most {t} that can distinguish between {G}‘s output and random strings with probability greater than {\epsilon}. Contradiction. Thus {(Enc, Dec)} is {(t-m, 2\epsilon)}-message indistinguishable. ◻

2. Security for Multiple Encryptions: Plain Version

In the real world, we often need to send more than just one message. Consequently, we have to create new definitions of security for such situations, where we use the same key to send multiple messages. There are in fact multiple possible definitions of security in this scenario. Today we shall only introduce the simplest definition.

Definition 4 [Message indistinguishability for multiple encryptions] {(Enc,Dec)} is {(t,\epsilon)}-message indistinguishable for {c} encryptions if for every {2c} messages {M_1,\ldots,M_c}, {M'_1,\ldots,M'_c} and every {T} of complexity {\leq t} we have

\displaystyle  | {\mathbb P} [ T(Enc(K,M_1), \ldots,Enc(K,M_c)) = 1]

\displaystyle  -{\mathbb P} [ T(Enc(K,M'_1), \ldots,Enc(K,M'_c)) = 1] | \leq \epsilon

Similarly, we define semantic security, and the asymptotic versions.

Exercise 2 Prove that no encryption scheme {(Enc,Dec)} in which {Enc()} is deterministic (such as the scheme for one-time encryption described above) can be secure even for 2 encryptions.

Encryption in some versions of Microsoft Office is deterministic and thus fails to satisfy this definition. (This is just a symptom of bigger problems; the schemes in those versions of Office are considered completely broken.)

If we allow the encryption algorithm to keep state information, then a pseudorandom generator is sufficient to meet this definition. Indeed, usually pseudorandom generators designed for such applications, including RC4, are optimized for this kind of “stateful multiple encryption.”

Next time, we shall consider a stronger model of multiple message security which will be secure against Chosen Plaintext Attacks.

Bounded Independence, AC0, and Random 3SAT

As reported here, here and here, Mark Braverman has just announced a proof of a 1990 conjecture by Linial and Nisan.

Mark proves that if {C} is an AC0 boolean circuit (with NOT gates and with AND gates and OR gates of unbounded fan-in) of depth {d} and size {S}, and if {X} is any {k}-wise independent distribution with {k = (\log S)^{O(d^2)}}, then

\displaystyle  {\mathbb E} C(U_n) \approx {\mathbb E} C(X)

that is, {X} “fools” the circuit {C} into thinking that {X} is the uniform distribution {U_n} over {\{0,1\}^n}. Plausibly, this might be true even for {k = O((\log S)^{d-1})}.

Nothing was known for depth 3 or more, and the depth-2 case was settled only recently by Bazzi, with a proof that, as you may remember, has been significantly simplified by Razborov about six months ago.

Mark’s proof relies on approximating {C} via low-degree polynomials. The point is that if {p} is an {n}-variate (real valued) polynomial of degree {\leq k}, and {X} is a {k}-wise independent distribution ranging over {\{0,1\}^n}, then

\displaystyle  {\mathbb E} p(X) = {\mathbb E} p(U_n)

Now if we could show that {p} approximates {C} both under {X} and under {U_n}, in the sense that {{\mathbb E} p(X) \approx {\mathbb E} C(X)}, and also {{\mathbb E} p(U_n) \approx {\mathbb E} C(U_n)}, then we would be done.

The Razborov-Smolenski lower bound technique gives a probabilistic construction of a polynomial {p} such that for every input {x} one has a high probability that {p(x) = C(x)}. In particular, one get one polynomial {p} such that both

\displaystyle   {\mathbb P}_{x\sim X} [p(x) = C(x) ] = 1-o(1) \ \ \ \ \ (1)

and

\displaystyle   {\mathbb P}_{x\sim U_n} [p(x) = C(x) ] = 1-o(1) \ \ \ \ \ (2)

Unfortunately this is not sufficient, because the polynomial {p} might be very large at a few points, and so even if {p} agrees with {C} with high probability there is no guarantee that the average of {p} is close to the average of {C}.

Using a result of Linial, Mansour and Nisan (developed in the context of learning theory), one can construct a different kind of low-degree approximating polynomial {p}, which is such that

\displaystyle   {\mathbb E}_{x\sim U_n} | C(x) - p(x)|^2 = o(1) \ \ \ \ \ (3)

The Linial-Mansour-Nisan approximation, however, says nothing about the relation between {p} and {C} under the distribution {X}.

Using ideas of Bazzi’s, however, if we had a single polynomial {p} such that properties (1), (2) and (3) are satisfied simultaneously, then we could construct another low-degree polynomial {p'} such that {{\mathbb E} p'(X) \approx {\mathbb E} C(X)}, and also {{\mathbb E} p'(U_n) \approx {\mathbb E} C(U_n)}, giving us that {C} is fooled by {X}.

As far as I understand, Mark constructs a polynomial satisfying properties (1), (2) and (3) by starting from the Razborov-Smolenski polynomial {p}, and then observing that the indicator function {E} of the points on which {p \neq C} is itself a boolean function admitting a Linial-Mansour-Nisan approximation {e}. Defining {p' := p\cdot (1-e)}, we have that {p'} has all the required properties, because multiplying by {1-e} “zeroes out” the points on which {p} is excessively large.

I have been interested in this problem for some time because of a connection with the complexity of 3SAT on random instances.

Continue reading

Decomposition Results and Regularity Lemmas

Several results in additive combinatorics have the flavor of decomposition results in which one shows that an arbitrary combinatorial object can be approximated by a “sum” of a “low complexity” part and a “pseudorandom” part. This is helpful because it reduces the task of proving that an arbitrary object has a certain property to the easier task of proving that low-complexity objects and pseudorandom objects have the property. (Provided that the property is preserved by the approximation and the sum.)

Many examples of such results are given in the paper accompanying Terry Tao’s tutorial at FOCS’07.

Usually, one needs a strong notion of approximation, and in such cases the complexity of the “low complexity” part is fantastically large (a tower of exponentials in the approximation parameter), as in the Szemeredi Regularity Lemma, which can be seen as a prototypical such “decomposition” result. In some cases, however, weaker notions of approximation are still useful, and one can have more reasonable complexity parameters, as in the Weak Regularity Lemma of Frieze and Kannan.

A toy example of a “weak decomposition result” is the following observation: Continue reading

Bounded Independence and DNFs

In 1990, Linial and Nisan conjectured that every AC0 function is “fooled” by polylog-wise independence. This means that if f: \{ 0,1 \}^n \rightarrow \{ 0,1\} is computed by a circuit of size S and depth d, and X is a distribution over \{ 0,1 \}^n such that every k bits are mutually independent and uniformly distributed (such a distribution is called k-wise independent), then

(1) \ \ \ {\mathbb E} f (X) \approx {\mathbb E} f(U_n)

where U_n is the uniform distribution over \{0,1\}^n, provided that k is around (\log S)^{O(d)}. (Maybe even around (\log S)^{d-1}.)

Nearly 20 years later, not much is known about this conjecture, except for a breakthrough result by Louay Bazzi from last year. Bazzi proves that (1) holds if f is computed by a size-S depth-2 circuit (that is, a CNF or a DNF) and X is a O( (\log S)^2)-wise independent distribution. This settles the depth-2 case of the Linial-Nisan conjecture.

A result of Linial, Mansour and Nisan plays an important role in the proof. The result is that if f is computed by a DNF of size S, then there is a function g: \{0,1\}^n \rightarrow {\mathbb R} that can be written as a degree-d polynomial with d= O((\log S)^2) and that is “close” to f, in the sense that

(2) \ \ \ {\mathbb E}  [ |f-g|^2]

is small.

Now, if g is a degree-d polynomial, and X is a d-wise independent distribution, we have

{\mathbb E} g(X) = {\mathbb E} g(U_n)

and we also have {\mathbb E} g(U_n) \approx  {\mathbb E} f(U_n) from (2). If we also had {\mathbb E} g(X) \approx {\mathbb E} f(X), then we would be done.

Unfortunately, we have no way to relate {\mathbb E} g(X) and {\mathbb E} f(X), because the fact that f and g are close does not imply that they have similar expectations under the distribution X, which could be concentrated on the few inputs on which f and g are very different.

Although the Linial-Mansour-Nisan result does not immediately give us the desired result, it is intuitive that it is useful to be able to approximate a DNF by low-degree polynomials, and to know that low-degree polynomials are perfectly “fooled” by d-wise independent distributions.

Bazzi observes that the following stronger polynomial approximation property is sufficient to prove that a distribution fools a given function.

Suppose that f:\{0,1\}^n \rightarrow \{0,1\} is a function, and that we are able to find two degree-d polynomials g_L,g_U such that for every x \in \{0,1\}^n we have

(3) \ \ \ g_L(x) \leq f(x) \leq g_U(x)

and

(4) \ \ \ {\mathbb E} g_U - g_L  \leq \epsilon

where the expectation in (4) is taken under the uniform distribution. Then it is easy to see that if X is d-wise independent, then

{\mathbb E} g_L(X) \leq {\mathbb E} f(X) \leq {\mathbb E} g_U(X)

and

{\mathbb E} g_L(U_n) \leq {\mathbb E} f(U_n) \leq {\mathbb E} g_U(U_n)

so that

(5) \ \ \ | {\mathbb E} f(U_n)  - {\mathbb E} f(X) | \leq \epsilon

It is also possible to use linear programming duality that argue that if f is fooled by d-wise independent distributions, then degree-d polynomials g_L,g_U as above must exist.

Bazzi then shows that, given a size-S DNF f, it is possible to find degree-d polynomials g_L,g_U as above with d = O((\log S)^2).

Finding such polynomials, in turn, is reduced rather cleverly to the task of finding a single polynomial g such that {\mathbb E} |f-g|^2 is small, as in Linial-Mansour-Nisan and, in addition, g has the property that whenever f(x)=0 then we always must have g(x)=0.

The construction of this “one sided” low-degree approximation of f is the bulk of Bazzi’s paper. A couple of days ago, Razborov has posted an exceedingly neat construction, that he explains in a page and a half! (Theorem 7 in the link above.)

It is rather appropriate that this has happened during the Banff Workshop on Analytic Tools in Computational Complexity, and tomorrow Scott Aaronson will give an impromptu talk on Razborov’s proof, hot from the presses.

Dense Subsets of Pseudorandom Sets: The Paper(s)

Green, Tao and Ziegler, in their works on patterns in the primes, prove a general result of the following form: if X is a set, R is a, possibly very sparse, “pseudorandom” susbset of X, and D is a dense subset of R, then D may be “modeled” by a large set M which has the same density in X as the density of D in R.

They use this result with X being the integers in a large interval \{ 1,\ldots, N\}, R being the “almost-primes” in X (integers with no small factor), and D being the primes in X. Since the almost-primes can be proved to be “pseudorandom” in a fairly strong sense, and since the density of the primes in the almost-primes is at least an absolute constant, it follows that the primes are “indistinguishable” from a large set M containing a constant fraction of all integers. Since such large sets are known to contain arbitrarily long arithmetic progressions, as proved by Szemeredi, Green and Tao are able to prove that the primes too must contain arbitrarily long arithmetic progressions. Such large sets are also known to contain arbitrarily long “polynomial progressions,” as proved by Bergelson and Leibman, and this allows Tao and Ziegler to argue that the primes too much contain arbitrarily long polynomial progressions.

(The above account is not completely accurate, but it is not lying too much.)

As announced last October here, and here, Omer Reingold, Madhur Tulsiani, Salil Vadhan and I found a new proof of this “dense model” theorem, which uses the min-max theorem of game theory (or, depending on the language that you prefer to use, the duality of linear programming or the Hahn Banach theorem) and was inspired by Nisan’s proof of the Impagliazzo hard-core set theorem. In complexity-theoretic applications of the theorem, our reduction has polynomial complexity, while the previous work incurred an exponential loss.

As discussed here and here, we also show how to use the Green-Tao-Ziegler techniques of “iterative partitioning” to give a different proof of Impagliazzo’s theorem.

After long procrastination, we recently wrote up a paper about these results.

In the Fall, we received some feedback from additive combinatorialists that while our proof of the Green-Tao-Ziegler result was technically simpler than the original one, the language we used was hard to follow. (That’s easy to believe, because it took us a while to understand the language in which the original proof was written.) We then wrote an expository note of the proof in the analyst’s language. When we were about to release the paper and the note, we were contacted by Tim Gowers, who, last Summer, had independently discovered a proof of the Green-Tao-Ziegler results via the Hahn-Banach theorem, essentially with the same argument. (He also found other applications of the technique in additive combinatorics. The issue of polynomial complexity, which does not arise in his applications, is not considered.)

Gowers announced his results in April at a talk at the Fields institute in Toronto. (Audio and slides are available online.)

Gowers’ paper already contains the proof presented in the “analyst language,” making our expository note not so useful any more; we have still posted it anyways because, by explaining how one translates from one notation to the other, it can be a short starting point for the computer scientist who is interested in trying to read Gowers’ paper, or for the combinatorialist who is interested in trying to read our paper.

Pseudorandomness for Polynomials

I am currently in Hong Kong for my second annual winter break visit to the Chinese University of Hong Kong. If you are around, come to CUHK on Tuesday afternoon for a series of back-to-back talks by Andrej Bogdanov and me.

First, I’d like to link to this article by Gloria Steinem. (It’s old but I have been behind with my reading.) I believe this presidential campaign will bring up serious reflections on issues of gender and race, and I look forward to the rest of it.

Secondly, I’d like to talk about pseudorandomness against low-degree polynomials.

Naor and Naor constructed in 1990 a pseudorandom generator whose output is pseudorandom against tests that compute affine functions in \mathbb F_2. Their construction maps a seed of length O(\log n /\epsilon) into an n-bit string in {\mathbb F}_2^n such that if L: {\mathbb F}_2^n \to {\mathbb F}_2 is an arbitrary affine function, X is the distribution of outputs of the generator, and U is the uniform distribution over {\mathbb F}_2^n, we have

(1) Pr [ L(X)=1] - Pr [ L(U)=1] | \leq \epsilon

This has numerous applications, and it is related to other problems. For example, if C\subseteq {\mathbb F}_2^m is a linear error-correcting code with 2^k codewords, and if it is such that any two codewords differ in at least a \frac 12 - \epsilon fraction of coordinates, and in at most a \frac 12 + \epsilon fraction, then one can derive from the code a Naor-Naor generator mapping a seed of length \log m into an output of length k. (It is a very interesting exercise to figure out how.) Here is another connection: Let S be the (multi)set of outputs of a Naor-Naor generator over all possible seeds, and consider the Cayley graph constructed over the additive group of {\mathbb F}_2^n using S as a set of generators. (That is, take the graph that has a vertex for every element of \{0,1\}^n, and edge between u and u+s for every s\in S, where operations are mod 2 and componentwise.) Then this graph is an expander: the largest eigenvalue is |S|, the degree, and all other eigenvalues are at most \epsilon |S| in absolute value. (Here too it’s worth figuring out the details by oneself. The hint is that in a Cayley graph the eigenvectors are always the characters, regardless of what generators are chosen.) In turn this means that if we pick X uniformly and Y according to a Naor-Naor distribution, and if A\subseteq {\mathbb F}_2^n is a reasonably large set, then the events X\in A and X+Y \in A are nearly independent. This wouldn’t be easy to argue directly from the definition (1), and it is an example of the advantages of this connection.

There is more. If f: \{0,1\}^n \rightarrow \{0,1\} is such that the sum of the absolute values of the Fourier coefficients is t, X is a Naor-Naor distribution, and U is uniform, we have
| Pr [ f(X)=1] - Pr [ f(U)=1] | \leq t \epsilon |
and so a Naor-Naor distribution is pseudorandom against f too, if t is not too large. This has a number of applications: Naor-Naor distribution are pseudorandom against tests that look only at a bounded number of bits, it is pseudorandom against functions computable by read-once branching programs of width 2, and so on.

Given all these wonderful properties, it is natural to ask whether we can construct generators that are pseudorandom against quadratic polynomials over {\mathbb F}_2^n, and, in general, low-degree polynomials. This question has been open for a long time. Luby, Velickovic, and Wigderson constructed such a generator with seed length 2^{(\log n)^{1/2}}, using the Nisan-Wigderson methodology, and this was not improved upon for more than ten years.

When dealing with polynomials, several difficulties arise that are not present when dealing with linear functions. One is the correspondence between pseudorandomness against linear functions and Fourier analysis; until the development of Gowers uniformity there was no analogous analytical tool to reason about pseudorandomness against polynomials (and even Gowers uniformity is unsuitable to reason about very small sets). Another difference is that, in Equation (1), we know that Pr [L(U)=1] = \frac 12, except for the constant function (against which, pseudorandomness is trivial). This means that in order to prove (1) it suffices to show that Pr[L(X)=1] \approx \frac 12 for every non-constant L. When we deal with a quadratic polynomial p, the value Pr [p(U)=1] can be all over the place between 1/4 and 3/4 (for non-constant polynomials), and so we cannot simply prove that Pr[p(X)=1] is close to a certain known value.

A first breakthrough with this problem came with the work of Bogdanov on the case of large fields. (Above I stated the problem for {\mathbb F}_2, but it is well defined for every finite field.) I don’t completely understand his paper, but one of the ideas is that if p is an absolutely irreducible polynomial (meaning it does not factor even in the algebraic closure of {\mathbb F}), then p(U) is close to uniform over the field {\mathbb F}; so to analyze his generator construction in this setting one “just” has to show that p(X) is nearly uniform, where X is the output of his generator. If p factors then somehow one can analyze the construction “factor by factor,” or something to this effect. This approach, however, is not promising for the case of small fields, where the absolutely irreducible polynomial x_1 + x_2 x_3 has noticeable bias.

The breakthrough for the boolean case came with the recent work of Bogdanov and Viola. Their starting point is the proof that if X and Y are two independent Naor-Naor generators, then X+Y is pseudorandom for quadratic polynomials. To get around the unknown bias problem, they divide the analysis into two cases. First, it is known that, up to affine transformations, a quadratic polynomial can be written as x_1x_2 + x_3x_4 + \cdots + x_{k-1} x_k, so, since applying an affine transformation to a Naor-Naor generator gives a Naor-Naor generator, we may assume our polynomial is in this form.

  • Case 1: if k is small, then the polynomial depends on few variables, and so even just one Naor-Naor distribution is going to be pseudorandom against it;
  • Case 2: if k is large, then the polynomial has very low bias, that is, Pr[p(U)] \approx \frac 12. This means that it is enough to prove that Pr[p(X+Y)] \approx \frac 12, which can be done using (i) Cauchy-Schwartz, (ii) the fact that U and U+X are nearly independent if U is uniform and X is Naor-Naor, and (iii) the fact that for fixed x the function y \rightarrow p(x+y) - p(x) is linear.

Now, it would be nice if every degree-3 polynomial could be written, up to affine transformations, as x_1x_2 x_3 + x_4x_5x_6 + \cdots, but there is no such characterization, so one has to find the right way to generalize the argument.

In the Bogdanov-Viola paper, they prove

  • Case 1: if p of degree d is correlated with a degree d-1 polynomial, and if R is a distribution that is pseudorandom against degree d-1 polynomials, then R is also pseudorandom against p;
  • Case 2: if p of degree d has small Gowers uniformity norm of dimension d, then Pr [p(U)=1] \approx \frac 12, which was known, and if R is pseudorandom for degree d-1 and X is a Naor-Naor distribution, then Pr[p(R+X)=1] \approx \frac 12 too.

There is a gap between the two cases, because Case 1 requires correlation with a polynomial of degree d-1 and Case 2 requires small Gowers uniformity U^d. The Gowers norm inverse conjecture of Green Tao is that a noticeably large U^d norm implies a noticeable correlation with a degree d-1 polynomial, and so it fills the gap. The conjecture was proved by Samorodnitsky for d=3 in the boolean case and for larger field and d=3 by Green and Tao. Assuming the conjecture, the two cases combine to give an inductive proof that if X_1,\ldots X_d are d independent Naor-Naor distributions then X_1+\ldots+X_d is pseudorandom for every degree-d polynomial.

Unfortunately, Green and Tao and Lovett, Meshulam, and Samorodnitsky prove that the Gowers inverse conjecture fails (as stated above) for d\geq 4 in the boolean case.

Lovett has given a different argument to prove that the sum of Naor-Naor generators is pseudorandom for low-degree polynomials. His analysis also breaks down in two cases, but the cases are defined based on the largest Fourier coefficient of the polynomial, rather than based on its Gowers uniformity. (Thus, his analysis does not differ from the Bogdanov-Viola analysis for quadratic polynomials, because the dimension-2 Gowers uniformity measures the largest Fourier coefficient, but it differs when d\geq 3.) Lovett’s analysis only shows that X_1 +\cdots + X_{2^{d-1}} is pseudorandom for degree-d polynomials, where X_1,\ldots,X_{2^{d-1}} are 2^{d-1} independent Naor-Naor generators, compared to the d that would have sufficed in the conjectural analysis of Bogdanov and Viola.

The last word on this problem (for now) is this paper by Viola, where he shows that the sum of d independent Naor-Naor generators is indeed pseudorandom for degree-d polynomials.

Again, there is a case analysis, but this time the cases depend on whether or not Pr [p(U)=1] \approx \frac 12.

If p(U) is noticeably biased (this corresponds to a small k in the quadratic model case), then it follows from the previous Bogdanov-Viola analysis that a distribution that is pseudorandom against degree d-1 polynomials will also be pseudorandom against p.

The other case is when p(U) is nearly unbiased, and we want to show
that p(X_1+\ldots +X_d) is nearly unbiased. Note how weak is the assumption, compared to the assumption that p has small dimension-d Gowers norm (in Bogdanov-Viola) or that all Fourier coefficients of p are small (in Lovett). The same three tools that work in the quadratic case, however, work here too, in a surprisingly short proof.

Terminology

Different communities have different traditions for terminology. Mathematicians appropriate common words, like ring, field, scheme, ideal,… and the modern usage of the term bears no connection with the everyday meaning of the word. Physicists have a lot of fun with their sparticles and their strangeness and charm and so on. Theoretical computer scientists, like the military, and NASA, prefer acronyms.

We have some isolated examples of felicitous naming. Expander, for example, is great: it sounds right and it is suggestive of the technical meaning. Extractor is my favorite, combining a suggestion of the meaning with a vaguely threatening sound. I think it’s too bad that seedless extractor has come to pass, because it evokes some kind of device to get grape juice. (I was on the losing side that supported deterministic extractor.)

Unfortunate namings are of course more common. Not only is the complexity class PP embarrassing to pronounce, but its name, derived from Probabilistic Polynomial time, is a poor description of it. By analogy with #P and $\oplus$P, it should be called MajP.

I heard the story of a famous (and famously argumentative) computer scientist complaining to one of the authors of the PCP theorem about the term PCP, which stands for Probabilistically Checkable Proof. “I too can define a probabilistic checker for SAT certificates,” he supposedly said, “with probability half check the certificate, with probability half accept without looking at it.” The point being that the terminology emphasizes a shortcoming of the construction (the probabilistic verification) instead of the revolutionary feature (the constant query complexity). Personally, I would prefer Locally Testable Proof.

Of course we will never change the name of PP or PCP, and the seedless extractors are here to stay, but there is one terminology change for which I’d like to start a campaign.

Naor and Naor constructed in 1990 a pseudorandom generator whose output is secure against linear tests. They called such a generator $\epsilon$-biased if the distinguishing probability of every linear test is at most $\epsilon$. Such generators have proved to be extremely useful in a variety of applications, most recently in the Bogdanov-Viola construction of pseudorandom generators again degree-2 polynomials.

Shall we start calling such generators $\epsilon$-unbiased? Seeing as it is the near lack of bias, rather than its presence, which is the defining feature of such generators?

(I know the reason for the Naor-Naor terminology: zero-bias generator makes perfect sense, while zero-unbiased makes no sense. But how about the fact that it is technically correct to say that the uniform distribution is $\frac {1}{10}$-biased?)

[Update: earlier posts on the same topic here and here]

The "Complexity Theory" Proof of a Theorem of Green-Tao-Ziegler

We want to prove that a dense subset of a pseudorandom set is indistinguishable from a truly dense set.

Here is an example of what this implies: take a pseudorandom generator of output length n, choose in an arbitrary way a 1% fraction of the possible seeds of the generator, and run the generator on a random seed from this restricted set; then the output of the generator is indistinguishable from being a random element of a set of size \frac 1 {100} \cdot 2^n.

(Technically, the theorem states the existence of a distribution of min-entropy n - \log_2 100, but one can also get the above statement by standard “rounding” techniques.)

As a slightly more general example, if you have a generator G mapping a length-t seed into an output of length n, and Z is a distribution of seeds of min-entropy at least t-d, then G(Z) is indistinguishable from a distribution of min-entropy n-d. (This, however, works only if d = O(\log n).)

It’s time to give a formal statement. Recall that we say that a distribution D is \delta-dense in a distribution R if

\forall x. Pr[R=x] \geq \delta \cdot Pr [D=x]

(Of course I should say “random variable” instead of “distribution,” or write things differently, but we are between friends here.)

We want to say that if F is a class of tests, R is pseudorandom according to a moderately larger class F', and D is \delta-dense in R, then there is a distribution M that is indistinguishable from D according to F and that is \delta-dense in the uniform distribution.

The Green-Tao-Ziegler proof of this result becomes slightly easier in our setting of interest (where F contains boolean functions) and gives the following statement:

Theorem (Green-Tao-Ziegler, Boolean Case)
Let \Sigma be a finite set, F be a class of functions f:\Sigma \to \{0,1\}, R be a distribution over \Sigma, D be a \delta-dense distribution in R, \epsilon>0 be given.

Suppose that for every M that is \delta-dense in U_\Sigma there is an f\in F such that
| Pr[f(D)=1] - Pr[f(M)] = 1| >\epsilon

Then there is a function h:\Sigma \rightarrow \{0,1\} of the form h(x) = g(f_1(x),\ldots,f_k(x)) where k = poly(1/\epsilon,1/\delta) and f_i \in F such that
| Pr [h(R)=1] - Pr [ h(U_\Sigma) =1] | > poly(\epsilon,\delta)

Readers should take a moment to convince themselves that the above statement is indeed saying that if R is pseudorandom then D has a model M, by equivalently saying that if no model M exists then R is not pseudorandom.

The problem with the above statement is that g can be arbitrary and, in particular, it can have circuit complexity exponential in k, and hence in 1/\epsilon.

In our proof, instead, g is a linear threshold function, realizable by a O(k) size circuit. Another improvement is that k=poly(1/\epsilon,\log 1/\delta).

Here is the proof by Omer Reingold, Madhur Tulsiani, Salil Vadhan, and me. Assume F is closed under complement (otherwise work with the closure of F), then the assumption of the theorem can be restated without absolute values

for every M that is \delta-dense in U_\Sigma there is an f\in F such that
Pr[f(D)=1] - Pr[f(M) = 1] >\epsilon

We begin by finding a “universal distinguisher.”

Claim
There is a function \bar f:\Sigma \rightarrow [0,1] which is a convex combination of functions from F and such that that for every M that is \delta-dense in U_\Sigma,
E[\bar f(D)] - E[\bar f(M)]  >\epsilon

This can be proved via the min-max theorem for two-players games, or, equivalently, via linearity of linear programming, or, like an analyst would say, via the Hahn-Banach theorem.

Let now S be the set of \delta |\Sigma| elements of \Sigma where \bar f is largest. We must have
(1) E[\bar f(D)] - E[\bar f(U_S)] >\epsilon
which implies that there must be a threshold t such that
(2) Pr[\bar f(D)\geq t] - Pr[\bar f(U_S) \geq t]  >\epsilon
So we have found a boolean distinguisher between D and U_S. Next,
we claim that the same distinguisher works between R and U_\Sigma.

By the density assumption, we have
Pr[\bar f(R)\geq t] \geq \delta \cdot Pr[\bar f(D) \geq t]

and since S contains exactly a \delta fraction of \Sigma, and since the condition \bar f(x) \geq t always fails outside of S (why?), we then have
Pr[\bar f(U_\Sigma)\geq t] = \delta \cdot Pr[\bar f(U_S) \geq t]
and so
(3) Pr[\bar f(R)\geq t] - Pr[\bar f(U_\Sigma) \geq t]  >\delta \epsilon

Now, it’s not clear what the complexity of \bar f is: it could be a convex combination involving all the functions in F. However, by Chernoff bounds, there must be functions f_1,\ldots,f_k with k=poly(1/\epsilon,\log 1/\delta) such that \bar f(x) is well approximated by \sum_i f_i(x) / k for all $x$ but for an exceptional set having density less that, say, \delta\epsilon/10, according to both R and U_\Sigma.

Now R and U_\Sigma are distinguished by the predicate \sum_{i=1}^k f_i(x) \geq tk, which is just a linear threshold function applied to a small set of functions from F, as promised.

Actually I have skipped an important step: outside of the exceptional set, \sum_i f_i(x)/k is going to be close to \bar f(x) but not identical, and this could lead to problems. For example, in (3) \bar f(R) might typically be larger than t only by a tiny amount, and \sum_i f_i(x)/k might consistently underestimate \bar f in R. If so, Pr [ \sum_{i=1}^k f_i(R) \geq tk ] could be a completely different quantity from Pr [\bar f(R)\geq t].

To remedy this problem, we note that, from (1), we can also derive the more “robust” distinguishing statement
(2′) Pr[\bar f(D)\geq t+\epsilon/2] - Pr[\bar f(U_S) \geq t]  >\epsilon/2
from which we get
(3′) Pr[\bar f(R)\geq t+\epsilon/2] - Pr[\bar f(U_\Sigma) \geq t]  >\delta \epsilon/2

And now we can be confident that even replacing \bar f with an approximation we still get a distinguisher.

The statement needed in number-theoretic applications is stronger in a couple of ways. One is that we would like F to contain bounded functions f:\Sigma \rightarrow [0,1] rather than boolean-valued functions. Looking back at our proof, this makes no difference. The other is that we would like h(x) to be a function of the form h(x) = \Pi_{i=1}^k f_i(x) rather than a general composition of functions f_i. This we can achieve by approximating a threshold function by a polynomial of degree poly(1/\epsilon,1/\delta) using the Weierstrass theorem, and then choose the most distinguishing monomial. This gives a proof of the following statement, which is equivalent to Theorem 7.1 in the Tao-Ziegler paper.

Theorem (Green-Tao-Ziegler, General Case)
Let \Sigma be a finite set, F be a class of functions f:\Sigma \to [0,1], R be a distribution over \Sigma, D be a \delta-dense distribution in R, \epsilon>0 be given.

Suppose that for every M that is \delta-dense in U_\Sigma there is an f\in F such that
| Pr[f(D)=1] - Pr[f(M)] = 1| >\epsilon

Then there is a function h:\Sigma \rightarrow \{0,1\} of the form h(x) = \Pi_{i=1}^k f_i(x) where k = poly(1/\epsilon,1/\delta) and f_i \in F such that
| Pr [f(R)=1] - Pr [ f(U_\Sigma) =1] | > exp(-poly(1/\epsilon,1/\delta))

In this case, we too lose an exponential factor. Our proof, however, has some interest even in the number-theoretic setting because it is somewhat simpler than and genuinely different from the original one.

Dense Subsets of Pseudorandom Sets

The Green-Tao theorem states that the primes contain arbitrarily long arithmetic progressions; its proof can be, somewhat inaccurately, broken up into the following two steps:

Thm1: Every constant-density subset of a pseudorandom set of integers contains arbitrarily long arithmetic progressions.

Thm2: The primes have constant density inside a pseudorandom set.

Of those, the main contribution of the paper is the first theorem, a “relative” version of Szemeredi’s theorem. In turn, its proof can be (even more inaccurately) broken up as

Thm 1.1: For every constant density subset D of a pseudorandom set there is a “model” set M that has constant density among the integers and is indistinguishable from D.

Thm 1.2 (Szemeredi) Every constant density subset of the integers contains arbitrarily long arithmetic progressions, and many of them.

Thm 1.3 A set with many long arithmetic progressions cannot be indistinguishable from a set with none.

Following this scheme is, of course, easier said than done. One wants to work with a definition of pseudorandomness that is weak enough that (2) is provable, but strong enough that the notion of indistinguishability implied by (1.1) is in turn strong enough that (1.3) holds. From now on I will focus on (1.1), which is a key step in the proof, though not the hardest.

Recently, Tao and Ziegler proved that the primes contain arbitrarily long “polynomial progressions” (progressions where the increments are given by polynomials rather than linear functions, as in the case of arithmetic progressions). Their paper contains a very clean formulation of (1.1), which I will now (accurately, this time) describe. (It is Theorem 7.1 in the paper. The language I use below is very different but equivalent.)

We fix a finite universe \Sigma; this could be \{ 0,1\}^n in complexity-theoretic applications or {\mathbb Z}/N{\mathbb Z} in number-theoretic applications. Instead of working with subsets of \Sigma, it will be more convenient to refer to probability distributions over \Sigma; if S is a set, then U_S is the uniform distribution over S. We also fix a family F of “easy” function f: \Sigma \rightarrow [0,1]. In a complexity-theoretic applications, this could be the set of boolean functions computed by circuits of bounded size. We think of two distributions X,Y as being \epsilon-indistinguishable according to F if for every function f\in F we have

| E [f(X)] - E[f(Y)] | \leq \epsilon

and we think of a distribution as pseudorandom if it is indistinguishable from the uniform distribution U_\Sigma. (This is all standard in cryptography and complexity theory.)

Now let’s define the natural analog of “dense subset” for distributions. We say that a distribution A is \delta-dense in B if for every x\in \Sigma we have

Pr [ B=x] \geq \delta Pr [A=x]

Note that if B=U_T and A=U_S for some sets S,T, then A is \delta-dense in B if and only if S\subseteq T and |S| \geq \delta |T|.

So we want to prove the following:

Theorem (Green, Tao, Ziegler)
Fix a family F of tests and an \epsilon>0; then there is a “slightly larger” family F' and an \epsilon'>0 such that if R is an \epsilon'-pseudorandom distribution according to F' and D is \delta-dense in R, then there is a distribution M that is \delta-dense in U_\Sigma and that is \epsilon-indistinguishable from D according to F.

[The reader may want to go back to (1.1) and check that this is a meaningful formalization of it, up to working with arbitrary distributions rather than sets. This is in fact the “inaccuracy” that I referred to above.]

In a complexity-theoretic setting, we would like to say that if F is defined as all functions computable by circuits of size at most s, then \epsilon' should be poly (\epsilon,\delta) and F' should contain only functions computable by circuits of size s\cdot poly(1/\epsilon,1/\delta). Unfortunately, if one follows the proof and makes some simplifications asuming F contains only boolean functions, one sees that F' contains functions of the form g(x) = h(f_1(x),\ldots,f_k(x)), where f_i \in F, k = poly(1/\epsilon,1/\delta), and h could be arbitrary and, in general, have circuit complexity exponential in 1/\epsilon and 1/\delta. Alternatively one may approximate h() as a low-degree polynomial and take the “most distinguishing monomial.” This will give a version of the Theorem (which leads to the actual statement of Thm 7.1 in the Tao-Ziegler paper) where F' contains only functions of the form \Pi_{i=1}^k f_i(x), but then \epsilon' will be exponentially small in 1/\epsilon and 1/\delta. This means that one cannot apply the theorem to “cryptographically strong” notions of pseudorandomness and indistinguishability, and in general to any setting where 1/\epsilon and 1/\delta are super-logarithmic (not to mention super-linear).

This seems like an unavoidable consequence of the “finitary ergodic theoretic” technique of iterative partitioning and energy increment used in the proof, which always yields at least a singly exponential complexity.

Omer Reingold, Madhur Tulsiani, Salil Vadhan and I have recently come up with a different proof where both \epsilon' and the complexity of F' are polynomial. This gives, for example, a new characterization of the notion of pseudoentropy. Our proof is quite in the spirit of Nisan’s proof of Impagliazzo’s hard-core set theorem, and it is relatively simple. We can also deduce a version of the theorem where, as in Green-Tao-Ziegler, F' contains only bounded products of functions in F. In doing so, however, we too incur an exponential loss, but the proof is somewhat simpler and demonstrates the applicability of complexity-theoretic techniques in arithmetic combinatorics.

Since we can use (ideas from) a proof of the hard core set theorem to prove the Green-Tao-Ziegler result, one may wonder whether one can use the “finitary ergodic theory” techniques of iterative partitioning and energy increment to prove the hard-core set theorem. Indeed, we do this too. In our proof, the reduction loses a factor that is exponential in certain parameters (while other proofs are polynomial), but one also gets a more “constructive” result.

If readers can stomach it, a forthcoming post will describe the complexity-theory-style proof of the Green-Tao-Ziegler result as well as the ergodic-theory-style proof of the Impagliazzo hard core set theorem.

The primes are random except when they are not

The next issue of the Bulletin of the AMS will have an article by Jaikumar Radhakrishnan and Madhu Sudan on Dinur’s proof of the PCP theorem. The article introduces the notion of PCP and its applications, as well as Irit’s proof.

There will also be an article on the work of Goldston-Pintz-Yildirim on small gaps between primes. Their main result is that for every eps>0 there are infinitely many pairs of consecutive primes whose difference is less than eps times the average distance between consecutive primes of that magnitude. That is, we can find infinitely many pairs of primes p,p’, p’ > p such that p’-p < eps log p. This is a step in the direction of proving that there are infinitely many pairs p,p’ such that p’-p =2, the twin primes conjecture.

This result is fascinating in various ways. One is the way it was discovered: Goldston and Yildirim had been working on gaps between primes since 1999, and announced the above result in 2003. But then a gap was found in the proof, which seemed to doom the whole approach. They kept working on it for two more years, however, until they made their breakthrough. You can read about it in Daniel Goldston’s amusing artcile my 30 minutes of fame.

Technically, their results improve on previous “sieve” techniques whose spirit is to show that the “primes are ‘pseudo-random’ except in the obvious ways in which they aren’t,” as Green and Tao more or less put it, a statement that is made precise in the Hardy-Littlewood conjecture. The idea is to start by thinking of the following probabilistic “model” for prime numbers: a number N is “prime” with probability 1/ln N, and to observe that certain properties of the primes (for example, the prime number theorem) hold in this model. Faced with a conjecture, one can check if it is true in the model, and then see if it is true for all sets of integers that have some “pseudorandomness” property and finally verify that the primes have such pseudorandomness property, usually via Fourier analysis. (This would be the “circle method” in number theory.)

The model, of course, has some major shortcomings. For example it predicts infinitely many even numbers to be primes, and infinitely many pairs of primes that differ only by one. We may correct the model by, more realistically, saying that if N>2 is even, then it is never prime, and if it is odd then it is prime with probability 2/ln N. This is already better, and it predicts some true asymptotic properties of the primes, but it still has problems. For example, it predicts inifinitely many triples of primes p,p+2,p+4, while 3,5,7 is the only such triple. (Why?) An even better model is that N>6 is prime with probability 3/ln N if N mod 6 is either 1 or 5, and N is never prime otherwise.

We see we can define a hierarchy of more and more accurate models: in general, for a fixed W, we can look at the distribution that picks a 1/ln N fraction of numbers from 1 to N and that only picks numbers N such that N and W share no common factor. As we pick W to be a product of more and more of the first few primes, we get a more and more accurate model. (This would be the issue of “major arcs” versus “minor arcs” in the circle method.)

Now look at a conjecture about primes, for example the twin primes conjecture, and see what these models predict. They all predict const*N/(ln N)2 twin primes between 1 and N, with the constant depending on W; for larger W, the constant tends to a limit ctwin. It is conjectured that there are in fact about ctwinN/(ln N)2 twin primes between 1 and N.

More or less, the Hardy-Littlewood conjecture implies that the prediction given by this model is always right for questions that involve “linear equations over primes.”

Apparently (I haven’t read the papers) Goldston-Pintz-Yildirim approach the pseudorandomness of primes by looking at “higher moments.” Green and Tao have been working on a different program based on studying the “Gowers uniformity” of the primes. (Technically, they study the Gowers uniformity of the Mobius function rather than of the characteristic function of the primes, but it’s close enough in spirit.)

In the paper Quadratic Uniformity of the Mobius Function they show that the primes have small dimension-3 Gowers uniformity, and, together with their earlier result on Gowers uniformity and polynomials, this is used in a new paper to prove that the predictions of the Hardy-Littlewood conjecture hold for every question about linear equations over primes of “bounded complexity.” For example, if you write certain pairs of linear equations over four prime unknowns, then the fraction of solutions will be as predicted by the limit of the random model. This includes the case of length-4 arithmetic progressions over primes, where you look at solutions p1,p2,p3,p4 to the equations p4-p3=p3-p2 and p3-p2=p2-p1.

These results and conjectures are part of an even bigger set of results whose spirit is that “multiplicative structure is pseudorandom with respect to addition,” that can be seen in various results that have applications to combinatorial constructions. This comes up most directly in sum-product results in finite fields and over the integers, used to construct extractors, in Weil’s result on character sums, which is used to construct $\epsilon$-biased generators, in certain expander constructions related to Ramanujan graphs, and so on.

It is hard to believe that, until recently, analytic number theory was considered an unfashionable area, what with its quantitative results and its lack of connections to algebraic geometry. For example, I understand that the Berkeley math department, which is quite large, has no analytic number theorist.