On the Unique Games Conjecture (part 1)

[I am preparing a survey talk on Unique Games for a mathematics conference, and writing a survey paper for a booklet that will be distributed at the conference. My first instinct was to write a one-line paper that would simply refer to Subhash’s own excellent survey paper. Fearing that I might come off as lazy, I am instead writing my own paper. Here is part 1 of some fragments. Comments are very welcome.]

Khot formulated the Unique Games Conjecture in a remarkably influential 2002 paper. In the subsequent eight years, the conjecture has motivated and enabled a large body of work on the computational complexity of approximating combinatorial optimization problems (the original context of the conjecture) and on the quality of approximation provided by “semidefinite programming” convex relaxations (a somewhat unexpected byproduct). Old and new questions in analysis, probability and geometry have played a key role in this development. Representative questions are:

  • The central limit theorem explains what happens when we sum several independent random variables. What happens if, instead of summing them, we apply a low degree polynomial function to them?
  • In Gaussian space, what is the body of a given volume with the smallest boundary?
  • What are the balanced boolean function whose value is most likely to be the same when evaluated at two random correlated inputs?
  • What conditions on the Fourier spectrum of a function of several variables imply that the function essentially depends on only few variables?
  • With what distortion is it possible to embed various classes of finite metric spaces into L1?

1. The Computational Complexity of Optimization Problems

The “Unique Games Conjecture” is a statement about the computational complexity of certain computational problems.

To briefly introduce computational complexity theory, consider the 3-Coloring problem. In this computational problem we are given as input an undirected graph {G=(V,E)}, and the goal is determine whether there is a proper 3-coloring of the vertices, that is a function {c: V \rightarrow \{0,1,2\}} such that {c(u)\neq c(v)} for every {(u,v) \in E}. (If such proper colorings exist, we are also interested in finding one.) This is a problem that is easily solvable in finite time: just consider in some order all possible {3^{|V|}} colorings {c: V \rightarrow \{0,1,2\}} and check each of them to see if it is a proper coloring. While finite, this is an unfeasible amount of computation even for very small graphs: for example a computer that examines 10 trillion colorings per second (comparable to the ability of the current fastest supercomputers) would take more than 10 billion years to consider {3^{64}} colorings. It is easy to improve the running time to about {2^{|V|}}, and there are non-trivial ways to achieve further speed-ups, but all the known algorithms have a worst-case running time that grows like {c^{|V|}} for a constant {c>1}, and they are unfeasible even on graphs with a few hundred vertices. Is there are an algorithm whose worst-case running time is bounded above by a polynomial function of {|V|}?

This is an open question equivalent to the {P} versus {NP} problem, one of the six unsolved Millenium Prize Problem. One thing that we know, however, is that the 3-Coloring problem is NP-complete. This means that every computational problem in {NP}, that is every computational problem that involves searching an exponentially big list of possibilities for an object satisfying a polynomial-time checkable property can be “encoded” as a 3-Coloring problem, and so a polynomial-time algorithm for 3-Coloring would imply a polynomial-time algorithm for every exponential-size-list-searching problem, that is, we would have {P=NP}.

As a concrete example, the following is true:

There is an algorithm that given an integer {n} runs in time polynomial in {n} and constructs a graph with {O(n^2)} vertices such that the graph has a proper 3-Coloring if and only if the Riemann Hypothesis has a proof of at most {n} pages.

In fact, there is nothing special about the Riemann Hypothesis: for every mathematical statement {S} and every integer {n} it is possible to efficiently construct a graph of size polynomial in the length of {S} and in {n} such that the graph is 3-colorable if and only if {S} has a proof of at most {n} pages.

The theory of NP-completeness can also be used to reason about combinatorial optimization problems, that is problems in which one wants to pick from an exponentially long list an item that maximizes or minimizes a given cost function.

A well-known example of combinatorial optimization problem is the Traveling Salesman Problem. This is the problem encountered, for example, by a shuttle driver who picks up eight people at the airport and has to decide in which order to take them to their homes and go back to the airport so that he is done in the shortest possible time. Formally, one is given a finite metric space with {n} points and wants to find an ordering of the points so that the sum of pairwise distances of consecutive pairs of points in the order is minimized. This problem can be solved in time growing roughly like {n!} by simply considering each possibility; there is a non-trivial algorithm whose running time grows like {2^n}, and all the known algorithms take exponential time. In fact, the problem, given an instance of TSP and a number {\ell}, to decide if there is a solution of cost at most {\ell} is an NP-complete problem.

Another example is the Max Cut problem. In this problem we are given an undirected graph {G=(V,E)} and we want to find a partition {(S,V-S)} of the set of vertices maximizing the number of edges that are cut by the partition, that is the number of edges that have one endpoint in {S} and one endpoint in {V-S}. Given a graph {G} and a number {c}, it is an NP-complete problem to determine if there is a partition such that at least {c} edges have endpoints on different sides of the partition.

A third example is the Sparsest Cut Problem. Given a {d}-regular graph {G=(V,E)} (that is, a graph in which each vertex is an endpoint of precisely {d} edges), we again want to find a partition {(S,V-S)}, but this time we want to minimize the number of edges cut by the partition relative to how balanced the partition is. Namely, we want to find the partition that minimizes the ratio

\displaystyle  \phi(S) := \frac{|E(S,V-S)|}{d\cdot |S| \cdot |V-S|/|V|}

where {E(S,V-S)} is the set of cut edges. The normalization is chosen so that the ratio in the optimal solution is always between 0 and 1. It is an NP-complete problem to decide, given a graph {G=(V,S)} and a number {r}, whether there is a set {S\subseteq V} such that {\phi(S) \leq r}.

Such NP-completeness results rule out (assuming {P\neq NP}) the possibility of algorithms of polynomial running time computing optimal solutions for the above problems. What about the computational complexity of finding approximate solutions?

The reductions that establish the above NP-completeness results do not offer much insight into the complexity of computing approximations. For example, the NP-completeness result for the Travelling Salesman Problem, relating it again to the task of finding a proof of the Riemann Hypothesis, gives the following implication:

There is an algorithm that, given an integer {n}, runs in time polynomial in {n} and constructs a finite metric space {X} with {N = n^{O(1)}} points such that all pairwise distances are either 1 or 2 and such that:

  • If there is a proof of the Riemann Hypothesis of at most {n} pages can then there is a tour in {X} of total length {N};
  • A tour of total length {N} in {X} can be efficiently converted into a valid proof of the Riemann Hypothesis of at most {n} pages.

Looking more carefully into the argument, however, one sees that the following two stronger implications are true:

  • If there is a proof of the Riemann Hypothesis of at most {n} pages with at most {k} mistakes, then it can be efficiently converted into a tour in {X} of total length {N};
  • A tour of total length {N+k} in {X} can be efficiently converted into a valid proof of the Riemann Hypothesis of at most {n} pages with at most {k} mistakes.

This means that if we had, for example, an algorithm that finds in polynomial time solutions to the Traveling Salesman Problem that are at most {1\%} worse than the optimal, we would have that we could find an {n}-page “proof” such that at most {1\%} of the steps are wrong. Since it is always easy to come up with a proof that contains at most one mistake (“trivially, we have {0=1}, hence {\ldots}”), this doesn’t cause any contradiction.

This doesn’t mean that approximating the Traveling Salesman Problem is easy: it just means that the instance produced by the NP-completeness proof are easy to approximate, and if one wants to prove a statement of the form “if there is a polynomial time algorithm for the Traveling Salesman Problem that finds solutions at most {1\%} worse than the optimum, then {P=NP},” then such a result requires reductions of a rather different form from the ones employed in the classical theory of NP-completeness.

Indeed, with few exceptions, proving intractability results for approximation problems remained an open question for two decades, until the proof of the \em PCP Theorem in the early 1990s. The PCP Theorem (PCP stands for Probabilistically Checkable Proofs) can be thought of as describing a format for writing mathematical proofs such that even a “proof” in which up to, say, {1\%} of the steps are erroneous implies the validity of the statement that it is supposed to prove.

Theorem 1 (The PCP Theorem) There is a constant {\epsilon_0} and a polynomial time algorithm that on input a graph {G=(V,E)} outputs a graph {G'=(V',E')} such that

  • If {G} has a proper 3-coloring then so does {G'}
  • If there is a coloring {c': V' \rightarrow \{1,2,3\}} such that at least a {1-\epsilon_0} fraction of the edges of {G'} are properly colored by {G'}, then {G} has a proper 3-coloring.

The counterpositive of the second property is that if {G} is not a 3-colorable graph then {G'} is a graph that is not even approximately 3-colorable, that is, {G'} is a graph such that every 3-coloring of the vertices leaves more than an {\epsilon_0} fraction of the edges monochromatic.

To see how this leads to “probabilistically checkable proofs,” let us return to our initial example of whether, for a given {n}, there is an {n}-page proof of the Riemann Hypothesis. For a given {n}, we can construct in time polynomial in {n} a graph {G} such that an {n}-page exists if and only if there is a proper 3-coloring of {G}. From {G} we can construct, again in time polynomial in {n}, a graph {G'} as in the PCP theorem. Now, an {n}-page proofs of the Riemann hypothesis can be encoded (at the cost of a polynomial blow-up in size) as a proper colorings of {G'}. Given a candidate proof, presented as a coloring of {G'}, we can think of it as having {|E'|} “steps,” each being the verification that one of the edges of {G'} has indeed endpoints of different colors. If an {n}-page proof of the Riemann Hypothesis exists, then there is a proof, in this format, all whose “steps” are correct; if there is no {n}-page proof of the Riemann Hypothesis, however, every “proof” is now such that at least an {\epsilon_0} fraction of the “steps” are wrong. If we sample at random {100/\epsilon_0} edges of {G'}, and check the validity of the given coloring just on those edges, we will find a mistake with extremely high probability. Thus the PCP theorem gives a way to write down mathematical proofs, and a probabilistic verification procedure to check the validity of alleged proofs that reads only a constant number of bits of the proof and such that valid proofs pass the probabilistic test with probability 1, and if the test passes with probability higher than {.001\%} then a valid proof exists.

While this application to proof checking is mostly interesting to visualize the meaning of the result, it might have applications to cryptographic protocols. In any case, the main application and motivation of the PCP Theorem is the study of the complexity of finding approximations to combinatorial optimization problems.

Various forms of the PCP Theorems are known, which are tailored to the study of specific optimization problems. A very versatile form of the Theorem, which was proved by Ran Raz, refers to the Label Cover problem.

2. The Unique Games Conjecture

Definition 2 (Label Cover) An input to the label cover problem with range {\Sigma} is a set of equations of the form

\displaystyle  X_i = \sigma_{i,j} (Y_j)

where {\sigma_{i,j} : \Sigma \rightarrow \Sigma} are functions specified as part of the input.

The goal is to find an assignment to the variables {X_i} and {Y_j} that satisfies as many equations as possible.

For example, the following is an instance of label cover with range {{\mathbb Z}/5{\mathbb Z}}:

\displaystyle  \begin{array}{l} X_1 = Y_1^2 -1 \bmod 5 \\ X_2 = Y_1 - 1 \bmod 5\\ X_1 = Y_2^4 + 1 \bmod 5 \end{array}

The first and third equation are not simultaneously satisfiable, and so an optimal solution to the above instance is to satisfy two of the equations, for example the first and the second with the assignment {X_1:= 4}, {X_2:=4}, {Y_1:=0}, {Y_2:=0}.

Notice that while the equations were of an algebraic nature in the example above, any function can be used to construct an equation.

Theorem 3 (Raz) For every {\epsilon >0} there is a {\Sigma}, {|\Sigma | \leq 1/\epsilon^{O(1)}} and a polynomial time algorithm that on input a graph {G} outputs an instance {C} of label cover with range {\Sigma} such that

  • If {G} has a proper 3-coloring then in {C} there is an assignment to the variables that satisfies all constraints;
  • If {G} is not properly 3-colorable, then every assignment to the variables of {C} satisfies at most an {\epsilon} fraction of the equations.

This form of the PCP Theorem is particularly well suited as a starting point for reductions, because in the second case we have the very strong guarantee that it is impossible to satisfy even just an {\epsilon} fraction of the equation. For technical reasons, it is also very useful that each equation involves only two variables.

The approach to derive intractability, for example for a graph problem, from Theorem 3 is to encode each variable as a small graph, and to lay out edges in such a way that the only way to have a good solution in the graph problem is to have it so that it defines a good solution for the label cover problem. If we are studying a cut problem, for example, and we have collection of vertices {v_{X,1},\ldots,v_{X,k}} corresponding to each variables {X} in the label cover instance, then a cut {(S,V-S)} in the graph gives a {k}-bit string {(b_{X,1},\ldots,b_{X,k})} for every variable {X} of label cover, corresponding to which of the {k} vertices does or does not belong to {S}.

Intuitively, one might try to set up such a reduction by associating each variables {X} with {k=\lceil \log_2 |\Sigma| \rceil} vertices, with each of the {2^k} bit strings being associated to one element of {\Sigma}. This, however, would not work, because it would be sufficient to change a {1/k} fraction of the cut in order to change all the values of {X}, and potentially we could start from a highly unsatisfiable instance of label cover and produce an instance of a graph problem in which there is a solution of very high quality.

Instead, the bit string associated to each variable encodes the value of the variable with an error-correcting code.

The problem then becomes: to make sure that (1) only bit strings close to a valid codeword can occur in a near-optimal solution, and (2) to make sure that in near optimal solutions the decodings satisfy a large number of equations. Task (2) has proved to be considerably harder than task (1), especially in reductions to graph problems. Indeed most NP-completeness results for approximating graph optimization problems have gone by first reducing label cover to an intermediate simpler problem, and then reducing the intermediate problem to the graph problem, but at the cost of weaker intractability results than the conjectured ones.

In 2002, Khot formulated a conjecture that considerably simplifies (2), essentially making it of difficulty comparable to (1).

Definition 4 (Unique Game) A unique game with range {\Sigma} is a set of equations of the form

\displaystyle  X_i = \sigma_{i,j} (Y_j)

where {\sigma_{i,j} : \Sigma \rightarrow \Sigma} are bijective functions specified as part of the input.

The goal is to find an assignment to the variables {X_i} and {Y_j} that satisfies as many equations as possible.

For example, the following is a unique game with range {{\mathbb Z} / 5{\mathbb Z}}:

\displaystyle  \begin{array}{l} X_1 = Y_1 + 3 \bmod 5\\ X_2 = Y_1 + 1 \bmod 5\\ X_1 = Y_2 -1 \bmod 5\\ X_2 = Y_2 - 1 \bmod 5 \end{array}

In the above example, it is not possible to satisfy all four equations, but the optimal solution {X_1 := 3, X_2:=1, Y_1 := 0, Y_2 = 2} satisfies three of the equations.

Notice that the only difference between a Label Cover instance and a Unique Game is that, in a Unique Game, the functions that define the equations have to be bijective. This is, however, a substantial difference. In particular, given a Unique Game that has a solution that satisfies all equations, such a solution can be found very quickly in time linear in the number of equations. But what if we are given a Unique Game in which there is a solution that satisfies, say, a {99\%} fraction of the equation?

Conjecture 1 (Unique Games Intractability Conjecture) For every {1/2 > \epsilon >0}, there is a {\Sigma} such that there is no polynomial time algorithm that, given an instance of Unique Games with range {\Sigma} in which it is possible to satisfy at least a {1-\epsilon} fraction of equations, finds a solution that satisfies at least an {\epsilon} fraction of equations.

If {P=NP} then Conjecture 1 is false; this means that proving Conjecture 1 would require first proving {P\neq NP}, which is beyond the reach of current techniques. The strongest evidence that we can currently hope for in favor of Conjecture 1 is:

Conjecture 2 (Unique Games NP-Hardness Conjecture) For every {1/2 > \epsilon > 0} there is a {\Sigma} and a polynomial time algorithm that, on input a graph {G} outputs a unique games instance {U} with range {\Sigma}, such that

  • If {G} is properly 3-colorable then there is an assignment that satisfies at least a {1-\epsilon} fraction of equations in {U};
  • If {G} is not properly 3-colorable then every assignment to the variables of {U} satisfies at most an {\epsilon} fraction of equations.

If Conjecture 2 is true, then every inapproximability result proved via a reduction from Unique Games establishes an NP-hardness of approximation, in the same way as a reduction starting from label cover.

Consequence for Max Cut

In the following we let

\displaystyle   \alpha := \min_{1/2 < \rho < 1} \frac {\frac 1\pi \cdot \arccos 1-2\rho} {\rho} \approx 0.878567 \ \ \ \ \ (1)

The above constant comes up in the remarkable algorithm of Goemans and Williamson.

Theorem 5 (Goemans and Williamson) There is a polynomial time algorithm that on input a graph {G=(V,E)} in which the optimal cut cuts {c} edges finds a cut that cuts at least {\alpha \cdot c} edges.

Furthermore, for sufficiently small {\epsilon}, given a graph {G=(V,E)} in which the optimal cut cuts {(1-\epsilon)\cdot |E|} edges, the algorithm finds a solution that cuts at least {\frac 1 \pi \cdot \arccos (-1+2\epsilon)\cdot |E|} edges, which is approximately {\left( 1- \frac 2 \pi \sqrt \epsilon \right) \cdot |E|} edges.

It is known that an approximation better than {16/17} implies that {P=NP}, but no NP-hardness result is known in the range between {\alpha} and {16/17}.

Work of Kindler, O’Donnell and Mossel, together with later work of Mossel, O’Donnel and Oleszkiewicz, proves that no improvement is possible over the Goemans-Williamson algorithm assuming the Unique Games Intractability Conjecture.

Theorem 6 Suppose that there is a {\delta >0}, a {\rho > 0} and a polynomial time algorithm that given a graph {G=(V,E)} in which an optimal cut cuts {\rho \cdot |E|} vertices finds a solution that cuts at least {\frac 1 \pi \cdot ( \arccos (1- 2\rho) + \delta )\cdot |E|} edges.

Then the Unique Games Intractability Conjecture is false.

In particular, by taking {\rho^*} to be the minimizer of Equation (1) we have that, for every {\delta >0} the existence of a polynomial time algorithm that, on input a graph in which the optimum is {c} finds a solution that cuts more than {(\alpha + \delta) \cdot c} edges would contradict the Unique Games Intractability Conjecture. So, assuming the conjecture, the constant {\alpha} is precisely the best achievable ratio between the value of polynomial-time constructible solutions and optimal solutions in the Max Cut problem.

In Section 3 we will present a detailed overview proof of Theorem 9.

Consequence for Sparsest Cut

The algorithm achieving the best ratio between the quality of an optimal solution and the quality of the solution found in polynomial time is due to Arora, Rao and Vazirani.

Theorem 7 (Arora-Rao-Vazirani) There is a polynomial time algorithm that given a graph {G=(V,E)} finds a cut {C} such that

\displaystyle  \phi(C) \leq O(\sqrt { \log |V| }) \cdot \phi(C^*)

where {C^*} is an optimal solution to the sparsest cut problem.

A classical algorithm based on spectral graph theory achieves a better approximation in graphs in which the optimum is large.

Theorem 8 (Spectral Partitioning) There is a polynomial time algorithm that given a graph {G=(V,E)} finds a cut {C} such that

\displaystyle  \phi(C) \leq O( \sqrt{ \phi(C^*) })

where {C^*} is an optimal solution to the sparsest cut problem.

Theorem 9 There is an absolute constant {c>0} such that if there is a {\delta >0}, an {\epsilon > 0} and a polynomial time algorithm that given a graph {G=(V,E)} in which the sparsest cut {C^*} satisfies {\phi(C^*) \leq \epsilon} finds a cut {C} such that

\displaystyle  \phi(C) \leq c \cdot \sqrt {\epsilon} - \delta

Then the Unique Games Intractability Conjecture is false.

In particular, assuming the conjecture, the trade-off between optimum and approximation in the spectral partitioning algorithm cannot be improved, and the approximation ratio in the Arora-Rao-Vazirani algorithm cannot be improved to a constant.

3. Reducing Unique Games to Max Cut

A general approach in using Unique Games (and, in general, Label Cover) with range {\Sigma} to other problems is to ensure that a solution in the target problem associates to each variable {X} of the unique game a function {f_X : \{ 0,1 \}^{\Sigma} \rightarrow \{ 0,1 \}}. Then we define a way to “decode” a function {f_X : \{ 0,1 \}^\Sigma \rightarrow \{ 0,1 \}} to a value {a_X \in \Sigma}, and we aim to prove that if we have a good solution in the target problem, then the assignment {X:= a_X} to each variable {X} defines a good solution in the Unique Games instance. The general idea is that if a function “essentially depends” one of its variables, then we decode it to the index of the variable that it depends on.

3.1. The Reduction from Unique Games to Max Cut

We outline this method by showing how it works to prove Theorem 9. To prove the theorem, we start from a Unique Games Instance {U} such that a {1-\epsilon'} fraction of equations can be satisfied, where {\epsilon'} is a very small positive parameter determined by {\epsilon}, and whose range is {\Sigma}. We show how to use the assumption of the Theorem to find a solution that satisfies at least an {\epsilon'} fraction of equations. We do so by constructing a graph {G}, applying the algorithm to find a good approximation to Max Cut in the graph, and then converting the cut into a good solution for the Unique Games instance.

If {U} has {N} variables, then {G} has {N\cdot 2^{\Sigma}} vertices, a vertex {v_{X,a}} for every variable {X} of {U} and every value {a\in \{ 0,1 \}^\Sigma}.

We define {G} as a weighted graph, that is a graph in which edges have a positive real-value weight. In such a case, the value of a cut is the total weight (rather than the number) of edges that are cut. There is a known reduction from Max Cut in weighted graph to Max Cut in unweighted simple graphs as defined above, so there is no loss of generality in working with weights.

We introduce the following notation: if {x\in \{ 0,1 \}^\Sigma} is an array of {|\Sigma|} bits indexed by the elements of {\Sigma}, and {\pi : \Sigma \rightarrow \Sigma} is a bijection, we denote by {x\circ \pi} the vector {x\circ \pi \in \{ 0,1 \}^\Sigma} such that {(x \circ \pi)_i := x_{\pi(i)}}.

We also define the noise operator {N_\rho} as follows: if {x\in \{ 0,1 \}^\Sigma} is a boolean vector, then {N_\rho(x)} is the random variable generated by changing each coordinate of {x} independently with probability {\rho}, and leaving it unchanged with probability {1-\rho}.

The edge set of {G} is defined so that its total weight is 1, and we describe it as a probability distribution:

  • Pick two random equations {X= \pi(Y)} and {X=\pi'(Y')} in {U} conditioned on having the same left-hand side.
  • Pick a random element {a\in \{ 0,1 \}^{\Sigma}} and pick an element {b \in N_{\rho} (a)}

  • Generate the edge {(v_{Y,a\circ \pi},v_{Y',b \circ \pi'})}

Let {A} be an optimal assignment for the Unique Games instance {U}. Consider the cut of {G} in which {S= \{ v_{Y,a} : a_{A(Y)} = 1 \}}. This cut cuts edges of total weight at least {\rho-2\epsilon'}. From our assumption, we can find in polynomial time a cut {S} that cuts a {\frac 1\pi \cdot (\arccos 1-2\rho) + \delta} fraction of edges. We want to show how to extract from {S} an assignment for the Unique Games that satisfies a reasonably large number of equations.

First we not that {S} assigns a bit to each variable {X} and to each {a\in \{ 0,1 \}^\Sigma}. Let us call

\displaystyle  f_Y (a) = 1 \mbox{ if } a\in S

and

\displaystyle  f_Y (a) = 0 \mbox{ if } a\not\in S

We went to decode each of these functions {f_X : \{ 0,1 \}^\Sigma \rightarrow \{ 0,1 \}} into an index {i \in \Sigma}. We describe a probabilistic decoding process {Dec(\cdot)} later.

Some calculations show that the functions that we derive in such a way have the property that

\displaystyle  \mathop{\mathbb E}_{X,Y,Y',a,b} [ f_Y(a\circ \pi) \neq f_{Y'} (b \circ \pi') ] \geq \frac 1 \pi \cdot (\arccos 1-2\rho) + \delta

and from this we want to derive that

\displaystyle  \mathop{\mathbb E}_{X,Y,Y'} [ \pi Dec(f_Y) = \pi' Dec(f_{Y'}) ] \geq \Omega_{\rho,\delta} (1)

from which it is easy to see that from the decodings {Dec(f_Y)} we can reconstruct an assignment for all variables that satisfies at least an {\epsilon'} fraction of equations in the unique game.

Some manipulations show that, essentially, it is sufficient to prove the following lemma:

Lemma 10 (Main) There is a probabilistic symmetric algorithm {Dec(\cdot)} that on input a function {f: \{ 0,1 \}^\Sigma \rightarrow \{ 0,1 \}} outputs an element {i \in \Sigma}, and such that the following is true. Suppose that {f: \{ 0,1 \}^\Sigma \rightarrow \{ 0,1 \}} is such that

\displaystyle   \mathop{\mathbb P} [ f(x) \neq f(N_\rho x) ] \geq \frac 1 \pi \cdot \arccos (1- 2\rho) + \delta \ \ \ \ \ (2)

Then there is an index {i\in \Sigma} such that

\displaystyle  \mathop{\mathbb P} [ Dec(f) = i] \geq \Omega_{\delta, \rho} (1)

We say that the decoding is symmetric if such the distribution of {Dec(f(\pi(\cdot)))} is the same as the distribution {\pi (Dec (f(\cdot)))} for every bijection {\pi : \Sigma \rightarrow \Sigma}.

(Technically, the Main Lemma is not sufficient as stated. An extension that deals with all bounded real-valued functions is necessary. The boolean case, which is simpler to state and visualize, captures all the technical difficulties of the general case.)

3.2. The Proof of the Main Lemma

Before discussing the proof of the Main Lemma, we show that it is tight, in the sense that from a weaker assumption in Equation (2) it is not possible to recover the conclusion.

Consider the majority function {Maj : \{ 0,1 \}^\Sigma \rightarrow \{ 0,1 \}} such that {Maj(x)=1} if and only if {x} has at least {|\Sigma|/2} ones. Then {Maj} is a symmetric function, in the sense that {Maj (x\circ \pi) = Maj (x)} for every bijection {\pi}. This implies that for every symmetric decoding algorithm {Dec} we have that {Dec(Maj)} is the uniform distribution over {\Sigma}, and so every index {i} has probability {1/|\Sigma|} which goes to zero even when the other parameters in the Main Lemma are fixed. A standard calculation shows that, for large {\Sigma},

\displaystyle  \mathop{\mathbb E} [ Maj (x) \neq Maj (N_\rho (x)) ] \approx \frac 1 \pi \arccos (1-2\rho)

so we have an example in which Equation (2) is nearly satisfied but the conclusion of the Main Lemma fails.

This example suggest that, if the Main Lemma is true, then the functions that satisfy Equation (2) must be non-symmetric, that is, it does not depend equally on all the input variables, and that the decoding procedure {Dec(\cdot)} must pick up certain input variables that the function depends in a special way on.

Another example to consider is that of the functions arising in the cut that one derives from an optimal solution in the unique game instance {U}. In that case, for every variable {Y} the corresponding function {f_Y} is of the form {f_Y(x) := x_i} where {i} is the value assigned to {Y} in the optimal solution. In this case, we would expect the decoding algorithm to output the index {i}. In general, if {f} depends only on a small number of variables, we would expect {Dec} to only output the indices of those variables.

These observations suggest the use of the notion of influence of input variables. If {f: \{ 0,1 \}^\Sigma \rightarrow \{ 0,1 \}} is a boolean function, then the influence of variable {i\in \Sigma} on {f} is the probability

\displaystyle  Inf_i (f) := \mathop{\mathbb P}_{x \in \{ 0,1 \}^\Sigma} [ f(x) \neq f(x + e_i) ]

where {x+e_i} is the string obtained from {x} by changing the {i}-th coordinate.

It is then appealing to consider the decoding algorithm that picks an index {i} with probability proportional to {Inf_i(f)}; note that this process is symmetric.

There is, unfortunately, a counterexample. Consider the function

\displaystyle  f(x_1,\ldots,x_n) := Maj ( x_1 , x_2, Maj (x_3,\ldots,x_n))

and take {\rho = 1-\epsilon}. Then {\frac 1\pi \cdot \arccos (1-2\rho) \approx 1-\frac 2\pi \sqrt \epsilon} and one can compute that

\displaystyle  \mathop{\mathbb P} [ f(x) \neq f(N_\rho(x)) ] \approx 1-\epsilon - \frac 1\pi \sqrt \epsilon > 1- \frac 2\pi \sqrt\epsilon + \Omega_\epsilon (1)

This means that we expect the decoding algorithm to select some index with a probability that is at least a fixed constant for every fixed {\epsilon}.

When we compute the influence of the variables of {f}, however, we find out that {x_1} and {x_2} have constant influence {1/2}, while the variables {x_3,\ldots,x_n} have influence approximately {1/\sqrt n}. This means that the sum of the influences is about {\sqrt n}, and so {x_1} and {x_2} would be selected with probability about {1/\sqrt n}, and the remaining variables with probability about {1/n}. In particular, all probabilities go to zero with {n}, and so a decoding algorithm based only on influence does not satisfy the conditions of the Main Lemma.

In order to introduce the correct definition, it helps to introduce discrete Fourier analysis over the Hamming cube. For our purposes, only the following facts will be used. One is that if {g: \{ 0,1 \}^\Sigma \rightarrow {\mathbb R}} is a real-valued function, then there is a unique set of real values {\hat g(S)}, one for each subset {S\subseteq \Sigma}, such that

\displaystyle  g(x) = \sum_S \hat g(S) \cdot (-1)^{\sum_{i\in S} x_i}

The values {\hat g(S)} are called the Fourier coefficients of {g}.

the other is that

\displaystyle  \mathop{\mathbb E} g^2(x) = \sum_S \hat g^2(S)

Deviating slightly from the above notation, if {f: \{ 0,1 \}^\Sigma \rightarrow \{ 0,1 \}} is a boolean function, then we let {\hat f(S)} be the Fourier coefficients of {(-1)^f}, that is

\displaystyle  (-1)^{f(x)} = \sum_S \hat f(S) \cdot (-1)^{\sum_{i\in S} x_i}

This guarantees that {\sum_S \hat f^2(S) = 1}.

It is easy to see that

\displaystyle  Inf_i(f) = \sum_{S: i\in S} \hat f^2(S)

The fact that {\sum_i \hat f^2(S) = 1} suggests that {\hat f} naturally defines a probability distribution. Unfortunately, it is a probability distribution over subsets of {\Sigma}, rather than a probability distribution over elements of {\Sigma}. A natural step is to consider the algorithm {Dec} defined as follows: sample a set {S\subseteq \Sigma} with probability equal to {\hat f^2(S)}, then output a random element of {S}. In particular, we have

\displaystyle   \mathop{\mathbb P} [ Dec(f) = i] = \sum_{S: \in S} \frac {\hat f^2(S)}{|S|} \ \ \ \ \ (3)

which is similar to the expression for the influence of {i}, but weighted to give more emphasis of the Fourier coefficients corresponding to smaller sets.

If we go back to the function {Maj(x_1,x_2,Maj(x_3,\ldots,x_n))}, we see that the algorithm defined in (3) has a probability of generating {x_1} and {x_2} which is at least an absolute constant, and that doesn’t go to zero with {n}.

The decoding algorithm described in Equation (3) turns out to be the correct one. Proving the main lemma reduces now to proving the following result.

Lemma 11 (Main Lemma — Restated) Suppose that {f: \{ 0,1 \}^\Sigma \rightarrow \{ 0,1 \}} is such that

\displaystyle   \mathop{\mathbb P} [ f(x) \neq f(N_\rho x) ] \geq \frac 1 \pi \cdot \arccos (1- 2\rho) + \delta \ \ \ \ \ (4)

Then there is an index {i\in \Sigma} such that

\displaystyle  \sum_{S: i \in S} \frac {\hat f^2(S)}{|S|} \geq \Omega_{\delta, \rho} (1)

The proof has two parts:

  • An invariance theorem due to Mossel, O’Donnell and Oleszkiewicz showing that the Main Lemma is true in the boolean setting provided that a “Gaussian version” of the Lemma hods for functions taking real inputs with Gaussian distribution is true;
  • A1985 theorem of Borell establishing the Gaussian version of the Lemma

3.3. The Invariance Theorem and Borell’s Theorem

A starting point to gain intuition about the Invariance Theorem is to consider the Central Limit Theorem. Suppose that {X_1,\ldots,X_n} is a collection of independent boolean random variables, each uniform over {\{ -1,1 \}}, and suppose that {a_1,\ldots,a_n} are arbitrary coefficients. Then the random variable

\displaystyle  \sum_i a_i X_i

is going to be close to a Gaussian distribution of average zero and variance {\sum_i a_i^2}, provided that the coefficients are reasonably smooth. (It is enough that if we scale them so that {\sum_i a_i^2 = 1}, then {\sum_i a_i^3} is small.)

Suppose now that, instead of considering a sum, that is, a degree-1 function, we take an {n}-variate low-degree polynomial {p} and we consider the random variable

\displaystyle  p(X_1,\ldots,X_n)

We cannot say any more that it has a distribution close to a Gaussian and, in fact, it does not seem that we can say anything at all. Looking back at the Central Limit Theorem, however, we can note that the “right” way of formulating it is to consider a collection {X_1,\ldots,X_n} of independent boolean random variables each uniform over {\{ -1,1 \}}, and also a collection of independent Gaussian random variables {Z_1,\ldots,Z_n} each with mean zero and variance 1. Then we have that the two random variables

\displaystyle  \sum_i a_i X_i \mbox { and } \sum_i a_i Z_i

are close provided that the {a_i} are smooth.

This is exactly the same statement as before, because the distribution {\sum_i a_i Z_i} happens to be a Gaussian distribution of mean zero and variance {\sum_i a_i^2}.

This formulation, however, as a natural analog to the case of low-degree polynomials. The Invariance Theorem states that if {p} is a sufficiently “smooth” low degree polynomial then the random variables

\displaystyle  p(X_1,\ldots,X_n) \mbox { and } p(Z_1,\ldots,Z_n)

are close.

When we apply the Invariance Theorem to a smoothed and truncated version of the Fourier transform of the function {f} in the Main Lemma, we have that either such a function is a “smooth polynomial” to which the Invariance Theorem applies, or else the conclusion holds and there is a coordinate with noticeably high probability of being output by the decoding algorithm. If the Invariance Theorem applies, then the probability that {f} changes value on anti-correlated boolean inputs is approximately the probability that a function changes its value on anti-correlated Gaussian inputs. The latter is given by a Theorem of Borrel

Theorem 12 (Borrel) Suppose {f: {\mathbb R}^n \rightarrow \{0,1\}} is a measurable function according to the standard Gaussian measure in {{\mathbb R}^n} and such that {\mathop{\mathbb E} f = 0}. For an element {x \in {\mathbb R}^n} and for {0 \leq \rho \leq 1/2}, let {N_\rho (x)} be the random variable {(1-2\rho) \cdot x + \sqrt{1-(1-2\rho)^2 z}} where {z} is a standard Gaussian random variable.

Then

\displaystyle  \mathop{\mathbb P} [ f(x) \neq f(N_\rho (x)) ] \geq \frac 1 \pi \arccos (1-2\rho)

The context of Borrel’s theorem is the question of what is the body in {{\mathbb R}^n} of a given volume (in that above case, of volume {1/2}) whose boundary is the smallest. (The answer is a half-space whose boundary is hyperplane passing through the origin, which is also a tight case for the theorem above.)

There are a few ways in which Borrel’s theorem is not the “Gaussian analog” of the Main Lemma. Notably, there is a condition on the expectation of {f} (in the boolean case, it would be the condition {\mathop{\mathbb E} f= 1/2}), there is a lower bound, rather than an upper bound, to the probability that {f} changes value, and the theorem applies to the range {\rho \in [0,1/2]}, while we are interested in the “anti-correlation” case of {\rho \in [1/2,1]}. There is a simple trick (consider only the “odd part” of the Fourier expansion of the boolean function {f} — that is only the terms corresponding to sets {S} of odd size) that takes care of all these differences.

In the next post:

The sparsest cut problem, its Unique Games-hardness of approximation, and its relation to metric embeddings, plus algorithms for unique games.

5 thoughts on “On the Unique Games Conjecture (part 1)

  1. Some comments:
    – “This means that every computational problem in {NP}, that is every computational problem that involves searching an exponentially big list of possibilities for an object satisfying a polynomial-time checkable property can be “encoded” as a 3-Coloring problem, and so a polynomial-time algorithm for 3-Coloring would imply a polynomial-time algorithm for every exponential-size-list-searching problem, that is, we would have {P=NP}.” This sentence will be hard to parse for people outside the area.
    – “there is a non-trivial algorithm whose running time grows like {2^n}”. Perhaps “non-trivial” is an overstatement. “More clever”?
    – It keeps saying you’ll prove Theorem 9, but you actually prove Theorem 6.
    – “The context of Borrel’s theorem… The answer is a half-space”. A half space has infinite volume under the uniform measure. Do you mean under Gaussian measure?

    Typos:
    – “counterpositive”
    – “First we not that”
    – “if such the distribution”
    – “hods for functions taking real inputs with Gaussian distribution is true;”
    – “A1985 theorem”
    – “as a natural analog”

  2. P.S. Thanks for keeping your audience entertained and enlightened with this informative post.

  3. you must refer to Rotar for the invariance principle. [MOO] is a variant of Rotar’s work from the 70s using similar methods, with perhaps better bounds. For the qualitative description above, Rotar should suffice. [MOO] refer to Rotar.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s