The Impagliazzo hard-core set theorem is one of the bits of magic of complexity theory. Say you have a function $g:\{ 0, 1 \}^n \rightarrow \{ 0,1\}$ such that every efficient algorithm makes errors at least $1\%$ of the times when computing $g$ on a random input. (We’ll think of $g$ as exhibiting a weak form of average-case complexity.) Clearly, different algorithms will fail on a different $1\%$ of the inputs, and it seems that, intuitively, there should be functions for which no particular input is harder than any particular other input, per se. It’s just that whenever you try to come up with an algorithm, some set of mistakes, dependent on the algorithmic technique, will arise.

As a good example, think of the process of generating $g$ at random, by deciding for every input $x$ to set $g(x)=1$ with probability $99\%$ and $g(x)=0$ with probability $1\%$. (Make the choices independently for different inputs.) With very high probability, every efficient algorithm fails with probability at least about $1\%$, but, if we look at every efficiently recognizable large set $H$, we see that $g$ takes the value 1 on approximately $99\%$ of the elements of $H$, and so the trivial algorithm that always outputs 1 has a pretty good success probability.

Consider, however, the set $H$ of size $\frac {2}{100} 2^n$ that you get by taking the $\approx \frac{1}{100} 2^n$ inputs $x$ such that $g(x)=0$ plus a random sample of $\frac{1}{100} 2^n$ inputs $x$ such that $g(x)=1$. Then we can see that no efficient algorithm can compute $g$ on much better than $50\%$ of the inputs of $H$. This is the highest form of average-case complexity for a boolean function: on such a set $H$ no algorithm does much better in computing $g$ than an algorithm that makes a random guess.

The Impagliazzo hard-core theorem states that it is always possible to find such a set $H$ where the average-case hardness is “concentrated.” Specifically, it states that if every efficient algorithm fails to compute $g$ on a $\geq \delta$ fraction of inputs, then there is a set $H$ of size $\geq \delta 2^n$ such that every efficient algorithm fails to compute $g$ on at least a $\frac 12 - \epsilon$ fraction of the elements of $H$. This is true for every $\epsilon,\delta$, and if “efficient” is quantified as “circuits of size $s$” in the premise, then “efficient” is quantified as “circuits of size $poly(\epsilon,\delta) \cdot s$” in the conclusion.

The example of the biased random function given above implies that, if one wants to prove the theorem for arbitrary $g$, then the set $H$ cannot be efficiently computable itself. (The example does not forbid, however, that $H$ be efficiently computable given oracle access to $g$, or that a random element of $H$ be samplable given a sampler for the distribution $(x,g(x))$ for uniform $x$.)

A number of proofs of the hard core theorem are known, and connections have been found with the process of boosting in learning theory and with the construction and the decoding of certain error-correcting codes. Here is a precise statement.

Impagliazzo Hard-Core Set Theorem
Let $g:\{0,1\}^n \rightarrow \{0,1\}$ be a boolean function, $s$ be a size parameter, $\epsilon,\delta>0$ be given. Then there is a $c(\epsilon,\delta) = poly(1/\epsilon,1/\delta)$ such that the following happens.

Suppose that for every function $f:\{0,1\}^n \rightarrow \{0,1\}$ computable by a circuit of size $\leq c\cdot s$ we have

$\displaystyle Pr_{x \in \{0,1\}^n} [ f(x) = g(x) ] \leq 1-\delta$

Then there is a set $H$ of size $\geq \delta 2^n$ such that for every function $f$ computable by a circuit of size $\leq s$ we have

$\displaystyle Pr_{x\in H} [ f(x) = g(x) ] \leq \frac 12 + \epsilon$

Using the “finitary ergodic theoretic” approach of iterative partitioning, we (Omer Reingold, Madhur Tulsiani, Salil Vadhan and I) are able to prove the following variant.

Impagliazzo Hard-Core Set Theorem, “Constructive Version”
Let $g:\{0,1\}^n \rightarrow \{0,1\}$ be a boolean function, $s$ be a size parameter, $\epsilon,\delta>0$ be given. Then there is a $c(\epsilon,\delta) = exp(poly(1/\epsilon,1/\delta))$ such that the following happens.

Suppose that for every function $f:\{0,1\}^n \rightarrow \{0,1\}$ computable by a circuit of size $\leq c\cdot s$ we have

$\displaystyle Pr_{x \in \{0,1\}^n} [ f(x) = g(x) ] \leq 1-\delta$

Then there is a set $H$ such that: (i) $H$ is recognizable by circuits of size $\leq c\cdot s$; (ii) $|H| \geq \delta 2^n$, and in fact the number of $x$ in $H$ such that $g(x)=0$ is at least $\frac 12 \delta 2^n$, and so is the number of $x$ in $H$ such that $g(x)=1$; and (iii) for every $f$ computable by a circuit of size $\leq s$,

$\displaystyle Pr_{x\in H} [ g(x) = f(x) ] \leq max \{ Pr_{x\in H}[ g(x) = 0] , Pr_{x\in H} [g(x)=1] \} + \epsilon$

The difference is that $H$ is now an efficiently recognizable set (which is good), but we are not able to derive the same strong average-case complexity of $g$ in $H$ (which, as discussed as the beginning, is impossible in general). Instead of proving that a “random guess algorithm” is near-optimal on $H$, we prove that a “fixed answer algorithm” is near-optimal on $H$. That is, instead of saying that no algorithm can do better than a random guess, we say that no algorithm can do better than either always outputting 0 or always outputting 1. Note that this conclusion is meaningless if $g$ is, say, always equal to 1 on $H$, but in our construction we have that $g$ is not exceedingly biased on $H$, and if $\epsilon < \delta/2$, say, then the conclusion is quite non-trivial.

One can also find a set $H'$ with the same type of average-case complexity as in the original Impagliazzo result by putting into $H'$ a $\frac 12 \delta 2^n$ size sample of elements $x$ of $H$ such that $g(x)=0$ and an equal size sample of elements of $H$ such that $g$ equals 1. (Alternatively, put in $H'$ all the elements of $H$ on which $g$ achieves the minority value of $g$ in $H$, then add a random sample of as many elements achieving the majority value.) Then we recover the original statement except that $c(\epsilon,\delta)$ is exponential instead of polynomial. [Update: constructing $H'$ is somewhat more complicated than we originally thought, the details are in the paper.]

Coming up next, the proof of the “constructive hard core set theorem” and my attempt at explaining what the techniques have to do with “finitary ergodic theory.”