Online Optimization Post 6: The Impagliazzo Hard-Core Set Lemma

(This is the sixth in a series of posts on online optimization techniques and their “applications” to complexity theory, combinatorics and pseudorandomness. The plan for this series of posts is to alternate one post explaining a result from the theory of online convex optimization and one post explaining an “application.” The first two posts were about the technique of multiplicative weight updates and its application to “derandomizing” probabilistic arguments based on combining a Chernoff bound and a union bound. The third and fourth post were about the Follow-the-Regularized-Leader framework, and how it unifies multiplicative weights and gradient descent, and a “gradient descent view” of the Frieze-Kannan Weak Regularity Lemma. The fifth post was about the constrained version of the Follow-the-Regularized-Leader framework, and today we shall see how to apply that to a proof of the Impagliazzo Hard-Core Lemma.)

1. The Impagliazzo Hard-Core Lemma

The Impagliazzo Hard-Core Lemma is a striking result in the theory of average-case complexity. Roughly speaking, it says that if {g: \{ 0,1 \}^n \rightarrow \{ 0,1 \}} is a function that is “weakly” hard on average for a class {\cal F} of “efficiently computable” functions {f}, that is, if, for some {\delta>0}, we have that

\displaystyle  \forall f \in {\cal F}: \ \ \Pr_{x\sim \{ 0,1\}^n} [f(x) = g(x) ] \leq 1 -\delta

then there is a subset {H\subseteq \{ 0,1 \}^n} of cardinality {\geq 2\delta 2^n} such that {g} is “strongly” hard-on-average on {H}, meaning that

\displaystyle  \forall f \in {\cal F}: \ \ \Pr_{x\sim H} [f(x) = g(x) ] \leq \frac 12 + \epsilon

for a small {\epsilon >0}. Thus, the reason why functions from {\cal F} make a mistake in predicting {g} at least a {\delta} fraction of the times is that there is a “hard-core” set {H} of inputs such that every function from {\cal F} makes a mistake about 1/2 of the times for the {2\delta} fraction of inputs coming from {H}.
The result is actually not literally true as stated above, and it is useful to understand a counterexample, in order to motivate the correct statement. Suppose that {\cal F} contains just {1/\delta} functions, and that each function {f\in \cal F} differs from {g} in exactly a {\delta} fraction of inputs from {\{ 0,1 \}^n}, and that the set of mistakes are disjoint. Thus, for every set {H\subseteq \{ 0,1 \}^n}, no matter its size, there is a function {f\in \cal F} that agrees with {g} on at least a {1-\delta} fraction of inputs from {H}. The reason is that the sets of inputs on which the functions of {\cal F} differ from {g} form a partition of {\{ 0,1 \}^n}, and so their intersections with {H} form a partition of {H}. By an averaging argument, one of those intersections must then contain at most {\delta |H|} elements of {H}.

In the above example, however, if we choose any three distinct functions {f_1,f_2,f_3} from {\cal F}, we have

\displaystyle  \forall x\in \{ 0,1 \}^n: \ \ \ g(x) = {\rm majority} (f_1(x), f_2(x),f_3(x))

So, although {g} is weakly hard on average with respect to {\cal F}, we have that {g} is not even worst-case hard for a slight extension of {\cal F} in which we allow functions obtained by simple compositions of a small number of functions of {\cal F}.

Theorem 1 (Impagliazzo Hard-Core Lemma) Let {\cal F} be a collection of functions {f: \{ 0,1 \}^n \rightarrow \{ 0,1 \}}, let {g: \{ 0,1 \}^n \rightarrow \{ 0,1 \}} a function, and let {\epsilon>0} and {\delta >0} be positive reals. Then at least one of the following conditions is true:

  • ({g} is not weakly hard-on-average over {\{ 0,1 \}^n} with respect to a slight extension of {\cal F}) There is a {k= O(\epsilon^{-2} \log \delta^{-1} )}, an integer {b}, and {k} functions {f_1,\ldots,f_k \in \cal F}, such that

    \displaystyle  h(x) := I \{ f_1(x) + \ldots + f_k(x)\geq b \}

    satisfies

    \displaystyle  \Pr_{x\in \{ 0,1 \}^n} [ g(x) = h(x) ] \geq 1-\delta

  • ({g} is strongly hard-on-average over a set {H} of density {2\delta}) There is a set {H\subseteq \{ 0,1 \}^n} such that {H \geq 2\delta \cdot 2^n} and

    \displaystyle  \forall f\in {\cal F}: \ \ \Pr_{x\in H} [ g(x) = f(x) ] \leq \frac 12 + \epsilon

Where {I \{ {\rm boolean\ expression} \}} is equal to {1} or {0} depending on whether the boolean expression is true or false (the letter “{I}” stands for “indicator” function of the truth of the expression).

2. Proving the Lemma

Impagliazzo’s proof had {k} polynomial in both {1/\epsilon} and {1/\delta}, and an alternative proof discovered by Nisan has a stronger bound on {k} of the order of {\epsilon^{-2} \log \epsilon^{-1} \delta^{-1}}. The proofs of Impagliazzo and Nisan did not immediately give a set of size {2\delta2^n} (the set had size {\delta 2^n}), although this could be achieved by iterating their argument. An idea of Holenstein allows to prove the above statement in a more direct way.

Today we will see how to obtain the Impagliazzo Hard-Core Lemma from online optimization, as done by Barak, Hardt and Kale. Their proof achieves all the parameters claimed above, once combined with Holenstein’s ideas.

Continue reading

The Impagliazzo Hard-Core Lemma for the Mathematician

At the conference I was at last week, I was asked by a mathematician about the exact statement of the Impagliazzo Hard-Core Lemma, and whether I could state it without saying “algorithm.”

So I gave the following statement:

Theorem 1 (Impagliazzo Hard-Core Lemma — Basic Version) Let {\cal F} be a class of boolean functions {f: \{ 0,1 \}^n \rightarrow \{ -1,1\}}, let {g: \{ 0,1 \}^n \rightarrow \{ -1,1\}} be an arbitrary function, and let {\epsilon,\delta>0} be arbitrary parameters.

Then at least one of the following conditions is true:

  1. {g} is globally well approximated by a simple composition of a few functions from {\cal F}.

    That is, there are functions {f_1,\ldots,f_k \in \cal F}, {k = \epsilon^{-O(1)} \cdot \delta^{-O(1)}} and a “simple” function {h: \{ 0,1 \}^k \rightarrow \{ 0,1 \}} such that

    \displaystyle  \mathop{\mathbb P}_{x \in \{ 0,1 \}^n} [ g(x) = h(f_1(x),\ldots,f_k(x)) ] \geq 1-\delta

  2. {g} is locally almost orthogonal to {\cal F}.

    That is there is a subset {H\subseteq \{ 0,1 \}^n} of density at least {\delta} such that for every {f\in \cal F},

    \displaystyle  \left| \mathop{\mathbb E}_{x\in H} f(x)g(x) \right| := \left| \frac 1{|H|} \sum_{x\in H} f(x) g(x) \right| \leq \epsilon

The theorem has been strengthened in a couple of ways in work of Klivans and Servedio and Holenstein. The “complexity” parameter {k} needs to be only {O(\epsilon^{-2} \log \delta^{-1})} which is conjectured (or known?) to be tight, and the density of the set {H} can be made {2\delta}, which is tight. In all proofs {h} is really simple, being more or less a threshold function applied to the sum of the {f_i}.

Stating the result in this way (which is, of course, standard) raises a few questions that I find rather interesting.

Continue reading

Applications of Low-Complexity Approximations

In the last post, I stated the following generalization of the weak regularity lemma:

Theorem (Low Complexity Approximation, TTV Version) Let (X,\mu) be a probability space, g:X \rightarrow [-1,1] a bounded function, F a collection of bounded functions f:X \rightarrow [ -1,1], and \epsilon an approximation parameter.

Then there is a function h: X \rightarrow [-1,1] such that

  • h has low complexity relative to F: there are k= O( \epsilon^{-2}) functions f_i\in F and coefficients c_i such that

    \displaystyle h(x) := \max \{ -1 , \min \{ 1, \sum_{i=1}^k c_i f_i (x)  \}\}

  • h is \epsilon-indistinguishable from g by F, that is,

    \forall f.F \ \ \left| {\mathbb E}_{x \sim \mu} f(x) g(x) - {\mathbb E}_{x \sim \mu} f(x) h(x) \right|  \leq \epsilon

(Last time, I mentioned that our proof handled only boolean functions f; now we can handle arbitrary bounded functions, and with an “energy-decrease” style proof, this will appear in the next online revision of the paper.)

This seems to be a useful tool, limited only by one’s creativity in choosing the functions F and then making use of the properties of h.

As already discussed,

  • if one takes X to be the edges of a complete graph, and F the set of indicator variables of cuts, then the existence of h gives the weak regularity lemma of Frieze and Kannan; and
  • if one takes F to be the set of circuits of size at most S(n), and normalizes g and h to be probability distributions, one gets that for every probability distribution D of high entropy there is a (non-uniformly) efficiently samplable and computable distribution M that is indistinguishable from D by circuits of size \leq S(n).

In this post I’ll show how to use it to give short proofs of the Hardcore Lemma of Impagliazzo and the Dense Model Theorem of Green, Tao and Ziegler. Both proofs also have, at least in hindsight, a sense of “inevitability,” meaning that given the Low-Complexity Approximation Theorem, and given what we want to prove, both proofs get to the point in a most economical and natural way.

  1. The Impagliazzo Hardcore Lemma. We have already mentioned that if g is “hard-on-average” for F, then h cannot be an approximation in the sense of being close to g on most inputs. What, then, about the points on which g and h differ? They form an Impagliazzo Hardcore Set for g, as described next.

    Let g: X \rightarrow \{-1,1\} be a function that is weakly hard on average for a class of algorithms F. Suppose, specifically, that for every algorithm h of complexity O(\epsilon^{-2}\delta^{-2}) relative to F we have

    {\mathbb P} [ g(x) = h(x) ] \leq 1-\delta

    and, more generally, for fractional h, we have

    (1) \ \ \ {\mathbb E} | g(x) - h(x) | \geq 2\delta

    Then, construct an approximating function h of complexity
    \leq \epsilon^{-2}\delta^{-2} relative to F and such that h and g are \epsilon\delta-indistinguishable by F. Note that, even though h is “indistinguishable” from g, it is also “far” from g, as in (1).

    Define a probability distribution H that assigns to point x a probability proportional to |g(x)-h(x)|. (If h were boolean, this would be the uniform distribution over points on which h and g differ.) Then this distribution is \delta-dense in the uniform distribution, meaning that every point has probability at most (\delta |X|)^{-1}. Observe also that we have

    g(x)\cdot |g(x)-h(x)| = g(x)-h(x)

    for every x, because g(x) and g(x)-h(x) have the same sign and |g(x)|=1, so we have

    \begin{array}{ll} \displaystyle {\mathbb E}_H g(x)f(x) & \displaystyle = \sum_x \frac {|g(x)-h(x)|}{\sum_y |g(y)-h(y)|} \cdot g(x)f(x) \\ & \displaystyle = \frac {|X|}{\sum_y |g(y)-h(y)|} \cdot {\mathbb E}_X |g(x)-h(x)|g(x)f(x) \\ & \displaystyle \leq \frac {1}{\delta} {\mathbb E}_X  (g(x)-h(x))f(x)  \\ & \displaystyle \leq \epsilon \end{array}

    and so H is a hardcore distribution, because the above expression is equivalent to

    {\mathbb P}_H [ g(x) = f(x) ] \leq \frac 12 + \frac \epsilon 2

  2. The Dense Model Theorem. Suppose that R\subseteq X is a pseudorandom set with respect to functions that have bounded complexity relative to F, and let D\subseteq R be a dense subset of R, |D| = \delta |R|.

    To find a dense model of D, we take g to be the characteristic function of D, and we let h be the low-complexity approximation, but using the uniform distribution on R as \mu.. Now suppose for simplicity that h is boolean, and that M is the set of inputs of X on which h is 1. We want to argue that M is a dense model of D. By assuming without loss of generality that F contains the all-one function, we get from the indistinguishability of g and h that

    \delta = {\mathbb E}_R g \approx {\mathbb E}_R h

    and from the pseudorandomness of R we have

    {\mathbb E}_R h \approx {\mathbb E}_X h = |M|/|X|

    and so |M| \approx \delta |X| and M is indeed dense.

    For the indistinguishability of M and D, take any function f, and observe that

    {\mathbb E}_M f = \delta^{-1} {\mathbb E}_R fg \approx \delta^{-1} {\mathbb E}_R fh \approx \delta^{-1} {\mathbb E}_X fh \approx {\mathbb E}_M f

    where we use both the indistinguishability of g and h under distribution R, and the fact that the distributions R and X are indistinguishable by functions of bounded complexity.

    This proof is appealingly intuitive, in the sense that if we expect D to be indistinguishable from a large set, then when we try to approximate the characteristic function of D we will end up with a low complexity function that is spread around much of X, thus defining a dense model. It also shows that “relative” versions of the Regularity Lemma, such as the Regularity Lemma for subgraphs of an expander, may be derived from the regular Lemma by the above argument. A disadvantage of the argument is that it does not establish the stronger form of the Dense Model Theorem suggested by Impagliazzo, in which there is no set R, but we require D to have the “pseudo-density” requirement that for every low-complexity bounded function f,

    {\mathbb E}_{x\in D} |f(x)| \leq \frac 1 {\delta} {\mathbb E}_{X} |f(x)| + \epsilon

    which follows immediately if D has density \delta in a pseudorandom set R, but that is a seemingly weaker property. (The relative Regularity Lemma in graphs had long be known to hold under such a pseudo-density assumption.)

Boosting and the Regularity Lemma

In a previous post, I described abstract forms of the weak regularity lemma, in which we start from an arbitrary bounded function g: X \rightarrow [-1,1] and an arbitrary set F of “structured” bounded functions f: X \rightarrow [-1,1], and we construct an “approximatng” function h that has “low complexity” relative to F and is “indistinguishable” from g by functions from F. We had two proofs, both generalizations of arguments due to Frieze and Kannan: in one proof h is not bounded, but its complexity is O(\epsilon^{-2}), where \epsilon is the approximation parameter; in the second proof, h is bounded but it has exponential complexity exp(O(\epsilon^{-2}) in the approximation parameter.

In a new paper by Madhur Tulsiani, Salil Vadhan, and I, we give a new proof that gives an approximating function that, at the same time, is bounded and has low complexity. This has a number of applications, which I will describe in a future post.

(Note that, in the statement below, F is required to contain Boolean functions, rather than bounded ones. It is possible, however, to “reduce” the bounded case to the boolean case via the following observation: if F is a family of bounded functions, F' is the family of boolean functions \{ 1_{f(x) \geq t} \ : \ f\in F, t\in [-1,1] \}, and g and h are \epsilon-indistinguishable according to F', then they are also \epsilon-indistinguishable according to F.)

Theorem (Low Complexity Approximation, TTV Version) Let (X,\mu) be a probability space, g:X \rightarrow [-1,1] a bounded function, F a collection of boolean functions f:X \rightarrow \{ -1,1\}, and \epsilon an approximation parameter.

Then there is a function h: X \rightarrow [-1,1] such that

  • h has low complexity relative to F: there are k= O( \epsilon^{-2}) functions f_i\in F and coefficients c_i such that

    \displaystyle h(x) := \max \{ -1 , \min \{ 1, \sum_{i=1}^k c_i f_i (x)  \}\}

  • h is \epsilon-indistinguishable from g by F, that is,

    \forall f.F \ \ \left| {\mathbb E}_{x \sim \mu} f(x) g(x) - {\mathbb E}_{x \sim \mu} f(x) h(x) \right|  \leq \epsilon

That is, h is simply a linear combination of functions from F, whose value is truncated to be between -1 and 1.

Recall that, when we do not require h to be bounded, h can be constructed via the following algorithm:

Algorithm FK

  • h_0 := 0; t= 0
  • while \exists f_{t+1} \in F such that {\mathbb E} f_{t+1} \cdot (g-h_t) \geq \epsilon
    • h_{t+1} := \sum_{i=1}^{t+1} \gamma f_{i}
    • t:=t+1
  • return h_t

And the analysis shows that, if we call \Delta_t := g - h_t, then {\mathbb E} \Delta_t^2 - \Delta_{t+1}^2 \geq  2\epsilon\gamma - \gamma^2. Setting \gamma := \epsilon shows that the algorithm stops within \epsilon^{-2} steps.

Our proof proceeds by doing exactly the same, but making sure at every step that h is bounded.

Algorithm TTV

  • h_0 := 0; t= 0
  • while \exists f_{t+1} \in F such that {\mathbb E} f_{t+1} \cdot (g-h_t) \geq \epsilon
    • h_{t+1} := \max \{ -1, \min \{ 1, \sum_{i=1}^{t+1} \gamma f_i \} \}
    • t:=t+1
  • return h_t

The problem with the old analysis is that now we could conceivably have a step in which \sum_{i=1}^t \gamma f_i (x) > 1+\gamma for all x, and so h_t = h_{t+1}, and thus \Delta_t = \Delta_{t+1}, and we have no energy decrease. To get to such a state, however, there must have been about 1/\gamma steps in which h changed in the same way as in the Frieze-Kannan case. I think there should be a sort of “amortized analysis” that would “charge” the lack of change of h at certain steps to other steps in which it indeed changes, and establish that, for every time step T,

{\mathbb E} \Delta_0^2 - \Delta_T^2 \geq \Omega(T \cdot \gamma \cdot (\epsilon - O(\gamma))

Unfortunately I don’t know how to construct such a proof. Our proof, instead, follows step by step Impagliazzo’s proof of the Impagliazzo Hardcore Set Lemma, which employs a very similar algorithm (in a quite different context). As shown by Klivans and Servedio, the algorithm in Impagliazzo’s proof can be seen as a boosting algorithm in learning theory (and every other known boosting algorithm can be used to prove Impagliazzo’s Hardcore Lemma); this is why we think of our proof as a “boosting” proof.

I must confess, I have never really understood Impagliazzo’s proof, and so I can’t say I really understand ours, other than being able to reproduce the steps.

The idea is to consider the quantity f_{t+1}(x) \Delta_{t}(x), which depends on t and x, and to see how it behaves summed over t, for a fixed x, and how it behaves for a fixed t, summed over x.

Suppose the algorithm is still running after T steps. Then, because of the way we define the termination condition, we have, for every t \in \{ 0,\ldots, T-1\}

\displaystyle (1) \ \ \ {\mathbb E}_x f_{t+1}(x)  (g(x)-h_t(x)) \geq  \epsilon

and the crux of the proof is to show that for every x we have

(2) \displaystyle \ \ \ \sum_{i=0}^{T-1} f_{t+1}(x)  (g(x)-h_t(x)) \leq \frac \gamma 2 \cdot T  + \frac 4 {\gamma}

So if we sum (1) over t and average (2) over x, we get

\displaystyle  \epsilon T \leq  \frac  \gamma 2 \cdot T  + \frac 4 {\gamma}

and setting \gamma := \epsilon gives T \leq 8 \epsilon^{-2}.

Inequality (2), the main step of the proof, is the part for which I have little intuition. One breaks the summation into groups of time steps, depending on the value of \Delta_t; there are 2/\gamma groups (because the value of h_t changes by discrete increments of \gamma, and is between -1 and +1) and each one is shown to contribute O(\gamma T') + O(1), where T' is the number of time steps in the group.

It is perhaps instructive to translate the Frieze-Kannan proof to this set-up. In the Frieze-Kannan algorithm, we have

\displaystyle  f_{t+1}(x)  (g(x)-h_t(x))  = \frac 1 {2\gamma} \cdot \left( \Delta_{t}^2 - \Delta^2_{t+1} +\gamma^2 f_{t+1}^2 \right)

and so

\begin{array}{ll} \displaystyle  \sum_{t=0}^{T-1} f_{t+1}(x)  (g(x)-h_t(x)) & = \displaystyle \frac 1{2\gamma} (\Delta_0^2 - \Delta_{T}^2) + \frac {\gamma}2 \cdot \sum_{t=1}^{T} f_t^2(x) \\  & \leq \displaystyle  \frac 1 {2\gamma}  + \frac \gamma 2 \cdot  T \end{array}

which is analogous to our inequality (2).

Impagliazzo Hard-Core Sets via "Finitary Ergodic-Theory"

In the Impagliazzo hard-core set theorem we are a given a function g:\{ 0, 1 \}^n \rightarrow \{ 0,1\} such that every algorithm in a certain class makes errors at least a \delta fraction of the times when given a random input. We think of \delta as small, and so of g as exhibiting a weak form of average-case complexity. We want to find a large set H\subseteq \{ 0,1 \}^n such that g is average-case hard in a stronger sense when restricted to H. This stronger form of average-case complexity will be that no efficient algorithm can make noticeably fewer errors while computing g on H than a trivial algorithm that always outputs the same value regardless of the input. The formal statement of what we are trying to do (see also the discussion in this previous post) is:

Impagliazzo Hard-Core Set Theorem, “Constructive Version”
Let g:\{0,1\}^n \rightarrow \{0,1\} be a boolean function, s be a size parameter, \epsilon,\delta>0 be given. Then there is a size parameter s' = poly(1/\epsilon,1/\delta) \cdot s +  exp(poly(1/\epsilon,1/\delta)) such that the following happens.

Suppose that for every function f:\{0,1\}^n \rightarrow \{0,1\} computable by a circuit of size s' we have

Pr_{x \in \{0,1\}^n} [ f(x) = g(x) ] \leq 1-\delta

Then there is a set H such that: (i) H is recognizable by circuits of size \leq s'; (ii) |H| \geq \delta 2^n, and in fact the number of x in H such that g(x)=0 is at least \frac 12 \delta 2^n, and so is the number of x in H such that g(x)=1; and (iii) for every f computable by a circuit of size \leq s,

Pr_{x\in H} [ g(x) = f(x) ] \leq max \{ Pr_{x\in H}[ g(x) = 0] , Pr_{x\in H} [g(x)=1] \} + \epsilon

Our approach will be to look for a “regular partition” of \{0,1\}^n. We shall construct a partition P= (B_1,\ldots,B_m) of \{0,1\}^n such that: (i) given x, we can efficiently compute what is the block B_i that x belongs to; (ii) the number m of blocks does not depend on n; (iii) g restricted to most blocks B_i behaves like a random function of the same density. (By “density” of a function we mean the fraction of inputs on which the function evaluates to one.)

In particular, we will use the following form of (iii): for almost all the blocks B_i, no algorithm has advantage more than \epsilon over a constant predictor in computing g in B_i.

Let M_0 be the union of all majority-0 blocks (that is, of blocks B_i such that g takes the value 0 on a majority of elements of B_i) and let M_1 be the union of all majority-1 blocks.

I want to claim that no algorithm can do noticeably better on M_0 than the constant algorithm that always outputs 0. Indeed, we know that within (almost) all of the blocks that compose M_0 no algorithm can do noticeably better than the always-0 algorithm, so this must be true for a stronger reason for the union. The same is true for M_1, with reference to the constant algorithm that always outputs 1. Also, if the partition is efficiently computable, then(in a non-uniform setting) M_0 and M_1 are efficiently recognizable. It remains to argue that either M_0 or M_1 is large and not completely unbalanced.

Recalling that we are in a non-uniform setting (where by “algorithms” we mean “circuits”) and that the partition is efficiently computable, the following is a well defined efficient algorithm for attempting to compute g:

Algorithm. Local Majority
On input x:
determine the block B_i that x belongs to;
output 1 if Pr_{z\in B_i} [g(z)=1] \geq \frac 12;
otherwise output 0

(The majority values of g in the various blocks are just a set of m bits that can be hard-wired into the circuit.)

We assumed that every efficient algorithm must make at least a \delta fraction of errors. The set of \geq \delta 2^n inputs where the Local Majority algorithm makes mistakes is the union, over all blocks B_i, of the “minority inputs” of the block B_i. (If b is the majority value of g in a block B, then the “minority inputs” of B are the set of inputs x such that g(x) = 1-b.)

Let E_0 be the set of minority inputs (those where our algorithm makes a mistake) in M_0 and E_1 be the set of minority inputs in M_1. Then at least one of E_0 and E_1 must have size at least \frac {\delta}{2} 2^n, because the size of their union is at least \delta 2^n. If E_b has size at least \frac {\delta}{2} 2^n, then M_b has all the properties of the set H we are looking for.

It remains to construct the partition. We describe an iterative process to construct it. We begin with the trivial partition P = (B_1) where B_1 = \{ 0,1\}^n. At a generic step of the construction, we have a partition P = (B_1,\ldots,B_m), and we consider M_0, M_1,E_0,E_1 as above. Let b be such that E_b \geq \frac 12 \delta 2^n. If there is no algorithm that has noticeable advantage in computing g over M_b, we are done. Otherwise, if there is such an algorithm f, we refine the partition by splitting each block according to the values that f takes on the elements of the block.

After k steps of this process, the partition has the following form: there are k functions f_1,\ldots,f_k and each of the (at most) 2^k blocks of the partition corresponds to a bit string b_1,\ldots,b_k and it contains all inputs x such that f_1(x)=b_1,\ldots,f_k(x)=b_k. In particular, the partition is efficiently computable.

We need to argue that this process terminates with k=poly(1/\epsilon,1/\delta). To this end, we define a potential function that measures the “imbalance” of g inside the blocks the partition

\Psi(B_1,\ldots,B_m) := \sum_{i=1}^m \frac {|B_i|}{2^n} \left( Pr_{x\in B_i} [g(x) = 1] \right)^2

and we can show that this potential function increases by at least poly(\epsilon,\delta) at each step of the iteration. Since the potential function can be at most 1, the bound on the number of iterations follows.

A reader familiar with the proof of the Szemeredi Regularity Lemma will recognize the main ideas of iterative partitioning, of using a “counterexample” to the regularity property required of the final partition to do a refinement step, and of using a potential function argument to bound the number of refinement steps.

In which way can we see them as “finitary ergodic theoretic” techniques? As somebody who does not know anything about ergodic theory, I may not be in an ideal position to answer this question. But this kind of difficulty has not stopped me before, so I may attempt to answer this question in a future post.

The Impagliazzo Hard-Core-Set Theorem

The Impagliazzo hard-core set theorem is one of the bits of magic of complexity theory. Say you have a function g:\{ 0, 1 \}^n \rightarrow \{ 0,1\} such that every efficient algorithm makes errors at least 1\% of the times when computing g on a random input. (We’ll think of g as exhibiting a weak form of average-case complexity.) Clearly, different algorithms will fail on a different 1\% of the inputs, and it seems that, intuitively, there should be functions for which no particular input is harder than any particular other input, per se. It’s just that whenever you try to come up with an algorithm, some set of mistakes, dependent on the algorithmic technique, will arise.

As a good example, think of the process of generating g at random, by deciding for every input x to set g(x)=1 with probability 99\% and g(x)=0 with probability 1\%. (Make the choices independently for different inputs.) With very high probability, every efficient algorithm fails with probability at least about 1\%, but, if we look at every efficiently recognizable large set H, we see that g takes the value 1 on approximately 99\% of the elements of H, and so the trivial algorithm that always outputs 1 has a pretty good success probability.

Consider, however, the set H of size \frac {2}{100} 2^n that you get by taking the \approx \frac{1}{100} 2^n inputs x such that g(x)=0 plus a random sample of \frac{1}{100} 2^n inputs x such that g(x)=1. Then we can see that no efficient algorithm can compute g on much better than 50\% of the inputs of H. This is the highest form of average-case complexity for a boolean function: on such a set H no algorithm does much better in computing g than an algorithm that makes a random guess.

The Impagliazzo hard-core theorem states that it is always possible to find such a set H where the average-case hardness is “concentrated.” Specifically, it states that if every efficient algorithm fails to compute g on a \geq \delta fraction of inputs, then there is a set H of size \geq \delta 2^n such that every efficient algorithm fails to compute g on at least a \frac 12 - \epsilon fraction of the elements of H. This is true for every \epsilon,\delta, and if “efficient” is quantified as “circuits of size s” in the premise, then “efficient” is quantified as “circuits of size poly(\epsilon,\delta) \cdot s” in the conclusion.

The example of the biased random function given above implies that, if one wants to prove the theorem for arbitrary g, then the set H cannot be efficiently computable itself. (The example does not forbid, however, that H be efficiently computable given oracle access to g, or that a random element of H be samplable given a sampler for the distribution (x,g(x)) for uniform x.)

A number of proofs of the hard core theorem are known, and connections have been found with the process of boosting in learning theory and with the construction and the decoding of certain error-correcting codes. Here is a precise statement.

    Impagliazzo Hard-Core Set Theorem
    Let g:\{0,1\}^n \rightarrow \{0,1\} be a boolean function, s be a size parameter, \epsilon,\delta>0 be given. Then there is a c(\epsilon,\delta) = poly(1/\epsilon,1/\delta) such that the following happens.

    Suppose that for every function f:\{0,1\}^n \rightarrow \{0,1\} computable by a circuit of size \leq c\cdot s we have

    \displaystyle Pr_{x \in \{0,1\}^n} [ f(x) = g(x) ] \leq 1-\delta

    Then there is a set H of size \geq \delta 2^n such that for every function f computable by a circuit of size \leq s we have

    \displaystyle Pr_{x\in H} [ f(x) = g(x) ] \leq \frac 12 + \epsilon

Using the “finitary ergodic theoretic” approach of iterative partitioning, we (Omer Reingold, Madhur Tulsiani, Salil Vadhan and I) are able to prove the following variant.

    Impagliazzo Hard-Core Set Theorem, “Constructive Version”
    Let g:\{0,1\}^n \rightarrow \{0,1\} be a boolean function, s be a size parameter, \epsilon,\delta>0 be given. Then there is a c(\epsilon,\delta) = exp(poly(1/\epsilon,1/\delta)) such that the following happens.

    Suppose that for every function f:\{0,1\}^n \rightarrow \{0,1\} computable by a circuit of size \leq c\cdot s we have

    \displaystyle Pr_{x \in \{0,1\}^n} [ f(x) = g(x) ] \leq 1-\delta

    Then there is a set H such that: (i) H is recognizable by circuits of size \leq c\cdot s; (ii) |H| \geq \delta 2^n, and in fact the number of x in H such that g(x)=0 is at least \frac 12 \delta 2^n, and so is the number of x in H such that g(x)=1; and (iii) for every f computable by a circuit of size \leq s,

    \displaystyle Pr_{x\in H} [ g(x) = f(x) ] \leq max \{ Pr_{x\in H}[ g(x) = 0] , Pr_{x\in H} [g(x)=1] \} + \epsilon

The difference is that H is now an efficiently recognizable set (which is good), but we are not able to derive the same strong average-case complexity of g in H (which, as discussed as the beginning, is impossible in general). Instead of proving that a “random guess algorithm” is near-optimal on H, we prove that a “fixed answer algorithm” is near-optimal on H. That is, instead of saying that no algorithm can do better than a random guess, we say that no algorithm can do better than either always outputting 0 or always outputting 1. Note that this conclusion is meaningless if g is, say, always equal to 1 on H, but in our construction we have that g is not exceedingly biased on H, and if \epsilon < \delta/2, say, then the conclusion is quite non-trivial.

One can also find a set H' with the same type of average-case complexity as in the original Impagliazzo result by putting into H' a \frac 12 \delta 2^n size sample of elements x of H such that g(x)=0 and an equal size sample of elements of H such that g equals 1. (Alternatively, put in H' all the elements of H on which g achieves the minority value of g in H, then add a random sample of as many elements achieving the majority value.) Then we recover the original statement except that c(\epsilon,\delta) is exponential instead of polynomial. [Update: constructing H' is somewhat more complicated than we originally thought, the details are in the paper.]

Coming up next, the proof of the “constructive hard core set theorem” and my attempt at explaining what the techniques have to do with “finitary ergodic theory.”