Edited 5/7/2018. Thanks to Sam Hopkins for several corrections and suggestions.
I am revising my notes from the course on “better-than-worst-case” analysis of algorithms. With the benefit of hindsight, in this post (and continuing in future posts) I would like to review again how one applies spectral methods and semidefinite programming to problems that involve a “planted” solution, and what is the role of concentration results for random matrices in the analysis of such algorithms.
In general, a “planted” distribution is one where we start from picking a random “planted” solution and then we pick an instance of our problem in which the planted solution is good. Such problems are interesting for the average-case analysis of algorithms, because they provide a non-trivial testing ground to understand why certain algorithms perform better in practice than in the worst case; planted problems are also of great interest in complexity theory, because when they are average-case hard they imply the existence of one-way functions; planted problems are also a rich ground for interdisciplinary work: statisticians recognize them as parametric distributions, where the parameter is the planted solutions, for which one wants to approximate the parameter given one sample, and information theorists think of them as a channel where the sender sends a planted solution and the receiver gets an instance of the problem, so that one recognizes that having a large mutual information between instance and planted solution is a necessary condition for the problem to be solvable.
For the sake of this post, it is helpful to focus on the motivating examples of planted -clique and planted bisection. In planted -clique, we sample a graph from the distribution, then we randomly select vertices and we add all the necessary edges to make those edges a clique. In the planted bisection, or stochastic block model, distribution, we randomly split a set of vertices into two subsets of equal size and then we select edges so that edges within and within have probability , and edges crossing the cut have probability , with .
A graph sampled from the planted clique distribution is a lot like a graph, except for the extra edges that create a large clique, and a graph sampled from the distribution is a lot like a random graph, except for the cut , which is crossed by roughly edges, is sparser than a typical balanced cut, which is crossed by about edges.
1. The Search Problem, the Decision Problem, and the Certification Problem
Having defined these distributions, our problem of interest is:
- The search problem: given a sample from the planted distribution, find the planted solution, or an approximation, in a useful sense, of the planted solution.
Usually, in planted problems, the solution that is planted is of a kind that exists only with negligible, or even exponentially small, probability in the “non-planted” distribution that we start from. Thus, an algorithm that solves the search problem with noticeable probability also solve the following problem with probability close to :
- The decision problem of distinguishing a sample of the planted distribution from a sample of the analogous “non-planted” distribution.
For small values of , achieving distinguishing probability in the decision problem can be (seemingly) much easier than solving the search problem with probabiltiy . For example, if , we do not know any polynomial time algorithm that solves the search problem for planted -clique with success probability, but checking whether the number of edges is solves the distinguishing problem with distinguishing probability order of . Even if we change our definition of our “non-planted” distribution so that it has edges on average, the number of triangles can still be used to noticeably distinguish it from the non-planted distribution.
In the stochastic block model if we call the expected “internal” degree of a vertex (that is, the number of neighbors on the same side of the planted cut) and the expected “external” degree, and if and are absolute constants, then it is known that the search problem is not solvable, even in an approximate sense, when , as proved by Mossel, Neeman and Sly (note that our definition of and is off from theirs by a factor of 2). Provided that and are absolute constants and , however, one can distinguish from with constant distinguishing probability. This is because the number of triangles (counted as ordered sequences of three vertices) in is, on average, , while it is, on average, in ; furthermore, the variance of both distributions is an absolute constants (those calculations appear in the above paper of Mossel et al.). Now if we have two nonnegative integer random variables and such that the expectations and variances of both are absolute constants and , then there is a threshold such that testing whether a number is is a test that distinguishes and with constant probability.
As far as I am aware, in all the known cases where the decision problem is efficiently solvable with distinguishing probability then it is also known how to efficiently solve the search problem (perhaps approximately) with probability, although algorithms that solve the distinguishing problem don’t always give insights into how to solve the search problem. Mossel et al., for example, show that the decision problem for the stochastic block model is efficiently solvable with distinguishing probability when , by counting the number of simple cycles of length around , and they could use the approach of counting simple cycles to approximate and , but they could not show how to solve the search problem in the regime, a problem that they solved later with additional techniques (the problem was also solved by Massoulie independently).
In summary, efficiently solving the decision problem is a necessary but not sufficient condition for being able to efficiently solve the search problem, and it is a good first step to understand the complexity of a given planted problem.
Another approach to distinguish the distribution from the planted -clique distribution is to able to certify, given a graph sampled from , that its maximum clique size is , a fact that is going to be true with high probability if . Assuming that, like in the planted clique problem, we are discussing instances of an optimization problem, we thus have the following problem:
- The certification problem: given an instance sampled from the “non-planted” distribution, produce a certificate that the optimum value for the instance is worse than the optimum value of all (or all but a negligible fraction of) instances from the planted distribution.
Note that if you can solve the certification with probability , then you can also solve the decision problem with distinguishing probability (or minus a negligible term, if the certified property is allowed to hold with negligible probability in the planted distribution).
In some interesting cases, the certification problem appears to be harder than the decision problem and the search problem.
In the stochastic block model, for example, if one takes the min bisection problem to be the underlying optimization problem, the search problem is solvable (in the sense that one can recover a partition that correlates to the planted partition) whenever , as discussed above, but for very close to it is an open question whether even knowing the value of the min bisection exactly can be used to solve the decision problem.
Another interesting example is “planted Max 3SAT.” Suppose that we construct a satisfiable instance of 3SAT in the following way: we start from a random assignment to variables, and then we pick at random clauses among the clauses that are consistent with the “planted” assignment. If is, say , then it is very easy to solve the search problem (and hence the distinguishing problem): the value of every variable in the planted assignment can be deduced by looking at whether the variable appears more often complemented or not complemented. In the non-planted distribution in which we pick clauses among all the possible clauses, however, we don’t know any efficient algorithm that certifies that instance is not satisfiable. Even if we pick clauses that are consistent with both the planted assignment and the negation of the planted assignment (thus eliminating correlations between complementations of variables and the values of the variables in the planted assignment), we still introduce pairwise correlations that can used to find the planted assignment. See the comment by Ryan O’Donnell below for more information.
When we are dealing with a hard optimization problem, a natural approach is to study efficiently solvable convex relaxations of the problem. One can hope to use a relaxation to solve the certification problem, by showing the relaxation provides a bound to the optimum in non-planted instances that is sufficient to distinguish them from planted instances. One can also hope to use a relaxation to solve the search problem, by showing that the optimum of the relaxation is close, in appropriate sense, to the planted solution, or even that the planted solution is the unique optimum of the relaxation for most planted instances.
Interestingly, in all the examples that I am aware of, a relaxation has been successfully used to solve the certification problem if and only if it has been successfully used to solve the search problem. Intuitively, if a relaxation does not solve the certification problem, it means that there are feasible solutions in the non-planted case whose cost is already better than the cost of the planted solution, and so the planted solution cannot be the optimum, or close to the optimum, of the relaxation in the planted case. For the other direction, if a relaxation solves the certification problem, then the proof that it does can usually be lifted to a proof that all solutions that don’t “correlate” (in an appropriate sense) with the planted solution cannot be optimal solutions in the planted case, allowing one to conclude that the optimum of the relaxation in the planted case correlates with the planted solution.
In summary, although the certification problem can be harder than the search problem and the decision problem, if one wants to use a relaxation to solve the search problem then it is good to start understanding whether the relaxation solves the certification problem.
In the next post we will discuss how we can efficiently certify properties of random instances of optimization problems, and how to turn those results into algorithms that find planted solutions. We will see the key results involve showing that certain matrices associated to the instances are close to their expectations in appropriately defined norms.