# Distinguishers from linear functions

In the last post we introduced the following problem: we are given a length-increasing function, the hardest case being a function ${G: \{ 0,1 \}^{n-1} \rightarrow \{ 0,1 \}^n}$ whose output is one bit longer than the input, and we want to construct a generator ${D}$ such that the advantage or distinguishing probability of ${D}$

$\displaystyle \left| \mathop{\mathbb P}_{z \in \{ 0,1 \}^{n-1}} [D(G(z)) =1 ] - \mathop{\mathbb P}_{x \in \{ 0,1 \}^{n}} [D(x) =1 ] \right| \ \ \ \ \ (1)$

is as large as possible relative to the circuit complexity of ${D}$.

I will show how to achieve advantage ${\epsilon}$ with a circuit of size ${O(\epsilon^2 n 2^n)}$. Getting rid of the suboptimal factor of ${n}$ is a bit more complicated. These results are in this paper.

Let us review some well known facts about distinguishability of distributions. If ${X}$ and ${Y}$ are two distributions over ${\{ 0,1 \}^n}$, then the advantage of the best distinguisher between ${X}$ and ${Y}$ is called the statistical distance (or total variation distance) between ${X}$ and ${Y}$, and is denoted ${||X-Y||_{SD}}$, hence

$\displaystyle ||X-Y||_{SD} := \max_{D:\{ 0,1 \}^n \rightarrow \{ 0,1 \}} \left| \mathop{\mathbb P}_{x\sim X} [ D(x)=1] - \mathop{\mathbb P}_{y\sim Y} [ D(y)=1] \right| \ \ \ \ \ (2)$

and it is easy to see that the statistical distance between two distributions is half their ${\ell_1}$ distance, if we think of them as vectors, that is

$\displaystyle ||X-Y||_1 := \sum_a \left| X(a) - Y(b) \right| = 2 || X-Y||_{SD} \ \ \ \ \ (3)$

If ${G:\{ 0,1 \}^{n-1} \rightarrow \{ 0,1 \}^n}$ is a length-increasing function, then the distribution of outputs ${G(z)}$ of the function (given a random input seed ${z}$) has statistical distance at least ${1/2}$ from the uniform distribution, as seen by using the distinguisher ${D}$ defined as

$\displaystyle D(x) =1 \Leftrightarrow \exists z . G(z)=x$

because ${D}$ outputs 1 with probability one given an output of ${G()}$, but with probability at most ${2^{n-1}/2^n = 1/2}$ given a uniformly sampled element of ${\{ 0,1 \}^n}$.

We are asking the question of whether there necessarily has to be a ${D}$ computable by a relatively small circuit which has non-trivially good advantage in distinguishing ${D}$ from the uniform distribution.

It has been well known since the 1980s (see the introduction of this survey paper by Oded Goldreich) that for every distribution ${X}$

$\displaystyle || X- U_n ||_{SD} = O(\sqrt{2^n}) \cdot \max_L \left| \mathop{\mathbb P}_{x\sim X} [L(x) = 1] - \mathop{\mathbb P}_{x\sim U_n} [L(x)=1] \right| \ \ \ \ \ (4)$

where the maximum is taken over all the ${GF(2)}$-linear functions ${L: \{ 0,1 \}^n \rightarrow \{ 0,1 \}}$, and where ${U_n}$ is the uniform distribution over ${U_n}$. This fact, which has an easy proof using Fourier analysis, is useful in showing that certain distributions over ${n}$-bit strings are nearly uniform, because (up to some loss which is tolerable if ${n}$ is small) it reduces this task to the task of showing that the xors of all subsets of the bits of the string are nearly unbiased.

Here we are interested in the implication that if ${G}$ is a length-increasing function then there exists a linear function ${L}$ such that

$\displaystyle \left| \mathop{\mathbb P}_{z\sim \{ 0,1 \}^{n-1}} [ L(G(z)) = 1] - \mathop{\mathbb P}_{x\sim \{ 0,1 \}^n} [ L(x)=1] \right| \geq \Omega \left( \frac 1 {\sqrt{2^n}} \right)$

This means that one can achieve distinguishing probability ${\epsilon = 2^{-n/2}}$ with a circuit of size ${O(n)}$, which has the promised size ${O(\epsilon^2 n 2^n)}$.

How do we “scale up” this result? For example, how do we achieve distinguishing probability ${2^{-n/3}}$ with a circuit of size ${O(n 2^{n/3})}$?

By analogy with our simplified proof of a result of Andreev, Clementi and Rolim that I described last week, we might try the following approach:

• partition ${\{ 0,1\}^n}$ into ${\epsilon^2 2^n}$ sets, each of size ${\epsilon^{-2}}$;
• find the linear function that is the best distinguisher within each set;
• create a distinguisher that, on input ${x}$, determines the set that ${x}$ belongs to, and then applies to ${x}$ the appropriate linear function.

The problem with this approach is how to define “best distinguisher within a set.” Our initial idea was to consider, for each set ${S}$ in the partition, the distribution ${G(z) | G(z) \in S}$ versus the uniform distribution over ${S}$, and then find the best linear distinguisher for each of these pairs of distributions. Unfortunately this approach runs into various difficulties if the distribution ${G(z)}$ does not hit the various sets in the partition with approximately the same probability. Our solution was to define the partition using pairwise independent hashing, which would have the effect that ${G}$ would hit the various sets with approximately the same probability. (This is the solution that I presented when I gave talks about this result in Princeton and New York.)

Fortunately, there is a much easier and attractive way to proceed. Just partition ${\{ 0,1 \}^n}$ into ${\epsilon^2 2^n}$ sets, each of size ${\epsilon^{-2}}$ based on the value of the first ${\log \epsilon^2 2^n}$ bits. Then, for each set ${S}$ in the partition ${\cal P}$, find the affine function ${L_S}$ that maximizes

$\displaystyle \mathop{\mathbb P}_{z\sim \{ 0,1 \}^{n-1}} [ L_S(G(z)) = 1 \wedge G(z) \in S ] - \mathop{\mathbb P}_{x\sim \{ 0,1 \}^{n}} [ L_S(x) = 1 \wedge x \in S ]$

and then define the distinguisher

$\displaystyle D(x) := \sum_{S \in {\cal P}} L_S(x) \cdot {\bf 1}_S(x)$

(where ${{\bf 1}_S}$ is the characteristic function of ${S}$) that, given ${x}$, determines which set ${S}$ of the partition ${x}$ belongs to and then returns ${L_S(x)}$.

The distinguisher ${D}$ can clearly be evaluated by a circuit of size ${O(\epsilon^2 n 2^n)}$, so it remains to understand its distinguishing probability. The main claim is that for every set ${S}$ there is an affine function ${L_S}$ such that

$\displaystyle \mathop{\mathbb P}_{z\sim \{ 0,1 \}^{n-1}} [ L_S(G(z)) = 1 \wedge G(z) \in S ] - \mathop{\mathbb P}_{x\sim \{ 0,1 \}^{n}} [ L_S(x) = 1 \wedge x \in S ]$

$\displaystyle \geq \frac 12 \mathop{\mathbb P}_{z\sim \{ 0,1 \}^{n-1}} [ G(z) \in S] - \frac 12 \mathop{\mathbb P}_{x\sim \{ 0,1 \}^n} [x\in S]$

$\displaystyle +\Omega\left( \frac 1 {\sqrt{|S|}} \right ) \cdot \sum_{a\in S} \left| \mathop{\mathbb P}_{z\sim \{ 0,1 \}^{n-1}} [G(z)=a] - \frac 1 {2^n} \right| \ \ \ \ \ (5)$

and if we sum these inequalities over all the sets ${S}$ we get the distinguishing probability of ${D}$ on the left-hand side and ${1/\sqrt{|S|}}$ times the ${\ell_1}$ distance of ${G(z)}$ from the uniform distribution on the right-hand side. Then the distinguishing probability of ${D}$ is at least ${\Omega(1/{\sqrt{|S|}})}$, and the sets ${S}$ were chosen to have size ${\epsilon^{-2}}$.

How does one prove (5)? More importantly, how does one begin to guess that an equation like (5) is true and that it provides the right “amortization” between the advantage within a block and the relative probabilities of hitting the block? And how is this any simpler than what we were doing before?

The answer is that everything becomes much simpler (and one does not have to reason about anything as messy as (5)) if we represent bits with ${\{-1,1\}}$ instead of ${\{0,1\}}$ and we observe that our original goal is to find an efficient distinguisher ${D: \{ 0,1 \}^n \rightarrow \{-1,1\}}$ such that

$\displaystyle \mathop{\mathbb E} D(G(z)) - \mathop{\mathbb E} D(x) \geq 2 \epsilon$

where ${z\sim \{ 0,1 \}^{n-1}}$ and ${x\sim \{ 0,1 \}^n}$, because

$\displaystyle \begin{array}{rcl} \mathop{\mathbb E} D(G(z)) & = & \mathop{\mathbb P} [ D(G(z)) =1] - \mathop{\mathbb P} [ D(G(z)) = -1 ] \\ & = & 2 \mathop{\mathbb P} [D(G(z))=1] - 1 \end{array}$

and same for ${D(x)}$, so

$\displaystyle \mathop{\mathbb E} D(G(z)) - \mathop{\mathbb E} D(x) = 2\cdot\left( \mathop{\mathbb P} [ D(G(z)) = 1] - \mathop{\mathbb P} [ D(x) = 1] \right)$

Our goal can also be written as proving

$\displaystyle \sum_{a\in \{ 0,1 \}^n} D(a)\cdot \left( \mathop{\mathbb P} [G(z)=a] - \frac 1 {2^n} \right) \geq 2\epsilon$

and if we write ${g(a):= \left( \mathop{\mathbb P} [G(z)=a] - \frac 1 {2^n} \right)}$ and recall what we said about statistical distance and ${\ell_1}$ distance, this is equivalent to proving

$\displaystyle \sum_{a\in \{ 0,1 \}^n} D(a) g(a) \geq 2\epsilon \sum_{a \in \{ 0,1 \}^n} |g(a)|$

Recalling that we are working with a partition of ${\{ 0,1 \}^n}$ into sets ${S}$, each of size ${\epsilon^{-2}}$, one would optimistically hope to do so by proving that for every set ${S}$

$\displaystyle \sum_{a\in S} D(a)g(a) \geq \frac{1}{\sqrt{|S|}} \sum_{a \in S} |g(a)| \ \ \ \ \ (6)$

Our proposed distinguisher computes an affine function within each set, so we would need to show that for every ${S}$ there is an affine function ${L: \{ 0,1 \}^n \rightarrow \{ 0,1 \}}$ such that

$\displaystyle \sum_{a\in S} (-1)^{L(a)} g(a) \geq \frac{1}{|\sqrt{|S|}} \sum_{a \in S} |g(a)| \ \ \ \ \ (7)$

Equation (7) is now in a form that is easy to prove via standard Fourier analytic arguments. Here we give a different proof which uses only the fact that affine functions form a pairwise independent family of functions and are closed under complementation.

Lemma 1 Let ${g: S \rightarrow {\mathbb R}}$ be any function, and ${H}$ be a pairwise independent family of functions ${h: S \rightarrow \{-1,1\}}$. Then there is a function ${h\in H}$ such that

$\displaystyle \left| \sum_{a\in S} h(a) \cdot g(a) \right| \geq \frac {1}{\sqrt{|S|}} \cdot \sum_{a\in S} |g(a)|$

Proof: Using the pairwise independence of ${h}$ we compute

$\displaystyle \mathop{\mathbb E}_{h \sim H} \left( \sum_{a\in S} h(a) g(a) \right)^2 = \sum_{a\in S} g^2(a)$

so there must exist an ${h\in H}$ such that

$\displaystyle \left| \sum_{a\in S} h(a) \cdot g(a) \right| \geq \sqrt{\sum_{a\in S} g^2(a)}$

and we have

$\displaystyle \sqrt{ \sum_{a\in S} g^2(a)} \geq \frac 1 {\sqrt{|S|}} \cdot \sum_{a\in S} |g(a)|$

from Cauchy-Schwarz. $\Box$