You are currently browsing the monthly archive for May 2007.

In CS70, the Berkeley freshman/sophomore class on discrete mathematics and probability for computer scientists, we conclude the section on probability with a class on how to lie with statistics. The idea is not to teach the students how to lie, but rather how not to be lied to. The lecture focuses on the correlation versus causation fallacy and on Simpson’s paradox.

My favorite way of explaining the correlation versus causation fallacy is to note that there is a high correlation between being sick and having visited a health care professional in the recent past. Hence we should prevent people from seeing doctors in order to make people healthier. Some HMOs in the US are already following this approach.

Today, a post in a New York Times science blog tells the story of a gross misuse of statistics in a Dutch trial that has now become a high-profile case. In the Dutch case two other, and common, fallacies have come up. One is, roughly speaking, neglecting to take a union bound. This is the fallacy of saying ‘I just saw the license plate California 3TDA614, what are the chances of that!’ The other is the computation of probabilities by making unwarranted independence assumptions.

Feynman has written eloquently about both, but I don’t have the references at hand. In particular, when he wrote on his Space Shuttle investigation committee work, he remarked that official documents had given exceedingly low probabilities of a major accident (of the order of one millionth per flight or less), even though past events have shown this probability to be more of the order of 1%. The low number was obtained by summing the probabilities of various scenarios, and the probability of each scenario was obtained by multiplying estimates for the probabilities that the various things that had to go wrong for that scenario to occur would indeed go wrong.

Christos Papadimitriou has the most delightful story on this fallacy. He mentioned in a lecture the Faloutsos-Faloutsos-Faloutsos paper on power law distributions in the Internet graph. One student remarked, wow, what are the chances of all the authors of a paper being called Faloutsos!

In the previous random kSAT post we saw that for every $k$ there is a constant $c_k$ such that

1. A random kSAT formula with $n$ variables and $m$ clauses is conjectured to be almost surely satisfiable when $m/n c_k + \epsilon$;
2. There is an algorithm that is conjectured to find satisfying assignments with high probability when given a random kSAT formula with $n$ variables and fewer than $(c_k – \epsilon) n$ clauses.

So, conjecturally, the probability of satisfiability of a random kSAT formula has a sudden jump at a certain threshold value of the ratio of clauses to variables, and in the regime where the formula is likely to be satisfiable, the kSAT problem is easy-on-average.

What about the regime where the formula is likely to be unsatisfiable? Is the problem still easy on average? And what would that exactly mean? The natural question about average-case complexity is: is there an efficient algorithm that, in the unsatisfiable regime, finds with high probability a certifiably correct answer? In other words, is there an algorithm that efficiently delivers a proof of unsatisfiability given a random formula with $m$ clauses and $n$ variables, $m> (c_k + \epsilon) n$?

Some non-trivial algorithms, that I am going to describe shortly, find such unsatisfiability proofs but only in regimes of fairly high density. It is also known that certain broad classes of algorithms fail for all constant densities. It is plausible that finding unsatisfiability proofs for random kSAT formulas with any constant density is an intractable problem. If so, its intractability has a number of interesting consequences, as shown by Feige.

A first observation is that if we have an unsatisfiable 2SAT formula then we can easily prove its unsatisfiability, and so we may try to come with some kind of reduction from 3SAT to 2SAT. In general, this is of course hopeless. But consider a random 3SAT formula $\phi$ with $n$ variables and $10 n^2$ clauses. Now, set $x_1 \leftarrow 0$ in $\phi$, and consider the resulting formula $\phi’$. The variable $x_1$ occurred in about $30 n$ clauses, positively in about $15 n$ of them (which have now become 2SAT clauses in $\phi’$) and negatively in about $15 n$ clauses, that have now disappeared in $\phi’$. Let’s look at the 2SAT clauses of $\phi’$: there are about $15 n$ such clauses, they are random, so they are extremely likely to be unsatisfiable, and, if so, we can easily prove that they are. If the 2SAT subset of $\phi’$ is unsatisfiable, then so is $\phi’$, and so we have a proof of unsatisfiability for $\phi’$.

Now set $x_1 \leftarrow 1$ in $\phi$, thus constructing a new formula $\phi”$. As before, the 2SAT part of $\phi”$ is likely to be unsatisfiable, and, if so, its unsatisfiability is easily provable in polynomial time.

Overall, we have that we can prove that $\phi$ is unsatisfiable when setting $x_1 \leftarrow 0$, and also unsatisfiable when setting $x_1\leftarrow 1$, and so $\phi$ is unsatisfiable.

This works when $m$ is about $n^2$ for 3SAT, and when $m$ is about $n^{k-1}$ for kSAT. By fixing $O(\log n)$ at a time it is possible to shave another polylog factor. These idea is due to Beame, Karp, Pitassi, and Saks.

A limitation of this approach is that it produces polynomial-size resolution proofs of unsatisfiability and, in fact tree-like resolution proofs. It is known that polynomial-size resolution proofs do not exist for random 3SAT formulas with fewer than $n^{1.5-\epsilon}$ clauses, and tree-like resolution proofs do not exist even when the number of clauses is just less than $n^{2-\epsilon}$. This is a limitation that afflicts all backtracking algorithms, and so all approaches of the form “let’s fix some variables, then apply the 2SAT algorithm.” So something really different is needed to make further progress.

Besides the 2SAT algorithm, what other algorithms do we have to prove that no solution exists for a given problem? There are algorithms for linear and semidefinite programming, and there is Gaussian elimination. We’ll see how they can be applied to random kSAT in the next theory post.

One bedroom, one million, no parking. It’s the `price reduced’ part in the description that gets me.

Pick a random instance of 3SAT by picking at random $m$ of the possible $8 {n\choose 3}$ clauses that can be constructed over $n$ variables. It is easy to see that if one sets $m=cn$, for a sufficiently large constant $c$, then the formula will be unsatisfiable with very high probability (at least $1-2^n \cdot (7/8)^m$), and it is also possible (but less easy) to see that if $c$ is a sufficiently small constant, then the formula is satisfiable with very high probability.

A number of questions come to mind:

1. If I plot, for large $n$, the probability that a random 3SAT formula with $n$ variables and $cn$ clauses is satisfiable, against the density $c$, what does the graph look like? We just said the probability is going to be close to 1 for small $c$ and close to $0$ for large $c$, but does it go down smoothly or sharply?

Here the conjecture, supported by experimental evidence, is that the graph looks like a step function: that there is a constant $c_3$ such that the probability of satisfiability is $1-o_n(1)$ for density $c_3$. A similar behavior is conjectured for kSAT for all $k$, with the threshold value $c_k$ being dependent on $k$.

Friedgut proved a result that comes quite close to establishing the conjecture.

For, say, 3SAT, the statement of the conjecture is that there is a value $c_3$ such that for every interval size $\epsilon$, every confidence $\delta$ and every sufficiently large $n$, if you pick a 3SAT formula with $(c_3+\epsilon)n$ clauses and $n$ variables, the probability of satisfiability is at most $\delta$, but if you pick a formula with $(c_3-\epsilon)n$ clauses then the probability of satisfiability is at least $1-\delta$.

Friedgut proved that for every $n$ there is a density $c_{3,n}$, such that for every interval size $\epsilon$, every confidence $\delta$ and every sufficiently large $n$, if you pick a 3SAT formula with $(c_{3,n}+\epsilon)n$ clauses and $n$ variables, the probability of satisfiability is at most $\delta$, but if you pick a formula with $(c_{3,n}-\epsilon)n$ clauses then the probability of satisfiability is at least $1-\delta$.

So, for larger and larger $n$, the graph of proability of satisfiability versus density does look more and more like a step function, but Friedgut’s proof does not guarantee that the location of the step stays the same. Of course the location is not going to move, but nobody has been able to prove that yet.

Semi-rigorous methods (by which I mean, methods where you make things up as you go along) from statistical physics predict the truth of the conjecture and predict a specific value for $c_3$ (and for $c_k$ for each $k$) that agrees with experiments. It remains a long-term challenge to turn these arguments into a rigorous proof.

For large $k$, work by Achlioptas, Moore, and Peres shows almost matching upper and lower bounds on $c_k$ by a second moment approach. They show that if you pick a random kSAT formula for large $k$ the variance of the number of satisfying assignments of the formula is quite small, and so the formula is likely to be unsatisfiable when the average number of assignments is close to zero (which actually just follows from Markov’s inequality), but also the formula is likely to be satisfiable when the average number of assignments is large. Their methods, however, do not improve previous results for 3SAT. Indeed, it is known that the variance is quite large for 3SAT, and the conjectured location of $c_3$ is not the place where the average number of assignments goes from being small to being large. (The conjectured value of $c_3$ is smaller.)

2. Pick a random formula with a density that makes it very likely that the formula is satisfiable: is this a distribution of inputs that makes 3SAT hard-on-average?

Before addressing the question we need to better specify what we mean by hard-on-average (and, complementarily, easy-on-average) in this case. For example, the algorithm that always says “satisfiable” works quite well; over the random choice of the formula, the error probability of the algorithm is extremely small. In such settings, however, what one would like from an algorithm is to produce an actual satisfying assignment. So far, all known lower bounds for $c_3$ are algorithmic, so in the density range in which we rigorously know that a random 3SAT formula is likely to be satisfiable we also know how to produce, with high probability, a satisfying assignment in polynomial time. The results for large $k$, however, are non-constructive and it remains an open question to match them with an algorithmic approach.

The statistical physics methods that suggest the existence of sharp thresholds also inspired an algorithm (the survey propagation algorithm) that, in experiments, efficiently finds satisfying assignments in the full range of density in which 3SAT formulas are believed to be satisfiable with high probability. It is an exciting, but very difficult, question to rigorously analyze the behavior of this algorithm.

3. Pick a random formula with a density that makes it very likely that the formula is unsatisfiable: is this a distribution of inputs that makes 3SAT hard-on-average?

Again, an algorithm that simply says “unsatisfiable” works with high probability. The interesting question, however, is whether there is an algorithm that efficiently and with high probability delivers certificates of unsatisfiability. (Just like the survey propagation algorithm delivers certificates of satisfiability in the density range in which they exist, or so it is conjectured.) This will be the topic of the next post.