Welcome to phase two of in theory, in which we again talk about math. I spent last Fall teaching two courses and getting settled, I mostly traveled in January and February, and I have spent the last two months on my sofa catching up on TV series. Hence I will reach back to last Spring, when I learned about Talagrand’s machinery of generic chaining and majorizing measures from Nikhil Bansal, in the context of our work with Ola Svensson on graph and hypergraph sparsification. Here I would like to record what I understood about the machinery, and in a follow-up post I plan to explain the application to hypergraph sparsification.
1. A Concrete Setting
Starting from a very concrete setting, suppose that we have a subset , we pick a random Gaussian vector from , and we are interested in the random variable
In theoretical computer science, for example, a random variable like (1) comes up often in the study of rounding algorithms for semidefinite programming, but this is a problem of much broader interest.
We will be interested both in bounds on the expectations of (1) and on its tail, but in this post we will mostly reason about its expectation.
A first observation is that each is Gaussian with mean zero and variance . If is finite, we can use a union bound to estimate the tail of as
The above bound can be tight, but it is poor if the points of are densely clustered, and it is useless if is infinite.
because . The latter expression is nicer to work with because it makes it more explicit that what we are trying to compute is invariant under shifts of , and only depends on pairwise distances of the elements of , rather than their norm.
In the cases in which (2) gives a poor bound, a natural approach is to reason about an -net of , that is, a subset such that for every there is an element such that . Then we can say that
which can be a much tighter bound. Notice that we used (2) to bound
but it might actually be better to find an -net of , and so on. In general, a tighter analysis would be to choose a sequence of nested sets , where , , and we have that is an -net of , that is, for every element of there is an element such that . Then, by generalizing the above reasoning, we get
Finally, if the cardinality of the grows sufficiently fast, namely, if we have , it is possible to refine the estimate to
where is the closest element to in . This is done by avoiding to write the expectation of the sup of a sum as a sum of expectations of sups and then using (2), but by bounding the tail of the sup of the sum directly.
At this point, we do not even need to assume that , that is finite, or that the sequence of sets is finite, and we have the following result.
Theorem 1 (Talagrand’s generic chaining inequality) Let be an arbitrary set, let be a countable sequence of finite subsets of such that and . Then
where is the element of closest to .
A short complete proof is in these notes by James Lee.
While the above Theorem has a very simple proof, the amazing thing, which is rather harder to prove, is that it is tight, in the sense that for every there is a sequence of sets such that the bound of the above theorem has a matching lower bound, up to an absolute constant. This is why it is called generic chaining. Chaining because the projection of on is estimated based on the “chain”
of projections of the intermediate steps of a path that goes from to passing through the . Generic because this upper bound technique works as well as any other possible upper bound, up to an absolute constant.
2. An Abstract Setting
Let now be a completely arbitrary set, and suppose that we have a distribution over functions and we want to upper bound
That is, we have a random optimization problem with a fixed feasible set , and we want to know the typical value of the optimum. For example, could be the set of cuts of a vertex set , and describe a distribution of random graphs such that is the number of edges cut in a random graph by the cut . Then the above problem is to estimate the average value of the max cut in the random graphs of the distribution. Or could be the unit sphere and describe a distribution of random Hermitian matrices such that is the quadratic form of a random matrix evaluated at . In this case, the above problem is to estimate the average value of the largest eigenvalue of such a random matrix.
We will call the collection of random variables a random process, where is a random variable distributed according to .
If every , and every finite linear combination , has a Gaussian distribution, then we say that is a Gaussian process, and if, in addition, for every then we say that it is a centered Gaussian process.
If , and we define for a random standard Gaussian , then is a centered Gaussian process and, in this case, upper bounding is precisely the problem we studied before.
If and for a random standard Gaussian , then, for every , we have
and, by analogy, if is a centered Gaussian process, we will define the following distance function on :
If is a centered Gaussian process then one can prove that the above distance function is a semi-metric on .
We will not need this fact, but if is a centered Gaussian process and is finite, then there is an embedding , for some , such that the process can be equivalently defined as picking and setting , so that is also an isometric embedding of the above distance function into .
The arguments of the previous section apply to centered Gaussian processes without change, and so we have.
Theorem 2 Let be an arbitrary set, and be a centered Gaussian process. Let be a countable sequence of finite subsets of such that and . Then
where is the distance function and is the element of closest to according to .
3. Sub-Gaussian Random Processes
This theory does not seem to apply to problems such as bounding the max cut of a unweighted graph, or bounding the largest eigenvalue of a random symmetric matrix with entries, because such problems have a finite sample space and so cannot be modeled as Gaussian processes.
Fortunately, there is a notion of a sub-Gaussian process, which applies to such problems and which reduces their analysis to the analysis of a related Gaussian process.
First, recall that a centered real-valued random variable is sub-Gaussian if there is a centered Gaussian random variable whose tail dominates the tail of , that is, if we have two constants and such that, for all :
An equivalent condition is that there is a such that
In that case, we can define a norm, called the norm as
which is, roughly, the standard deviation of a centered Gaussian that dominates .
Example 1 All bounded random variables are sub-Gaussian.
Example 2 If
where the are independent Rademacher random variables, that is, if each is equally likely to be +1 or -1, then is sub-Gaussian with , which is within a constant factor of its actual standard deviation.
Example 3 If
where the are independent and each has probability of being equal to and probability of being equal to (that is, each is a centered Bernoulli random variable), then , which is much more than the standard deviation of when is small.
Example 4 If
where the are independent Rademacher random variables, and the are arbitrary real scalars, then , which is within a constant factor of the standard deviation that we would get by replacing each with a standard Gaussian.
Let now be a centered random process. We will say that a Gaussian process -dominates if, for every we have
That is, every random variable of the form is sub-Gaussian, and its tail is dominated by the tail of the Gaussian distribution
Theorem 3 (Talagrand’s comparison inequality) There is an absolute constant such that if is a centered random process that is -dominated by a centered Gaussian random process , then
Furthermore, for every ,
where is the diameter of with respect to the distance function .
The way to apply this theory is the following.
Suppose that we want estimate, on average or with high probability, the optimum of an optimization problem with feasible set over the randomness of the choice of a random instance. We model this problem like a centered random process in which is the difference between the cost of solution in a random instance minus the average cost of .
Then we think about a related random experiment, in which the random choices involved in constructing our instance are replaced by Gaussian choices (for example, instead of a random graph we may think of a complete graph with Gaussian weights on the edges chosen with expectation 1/2 and constant variance) and we let be the analogous process in this Gaussian model.
If we can argue that dominates , then it remains to estimate , which we can do either by the generic chaining theorem or by other methods.
4. An Example
We will now use this machinery to show that the largest eigenvalue of a random symmetric matrix with Rademacher entries is . This is certainly not the simplest way of proving such a result, but it will give a sense of how these techniques can be applied.
We let be the unit sphere.
Our Gaussian process will be to pick standard Gaussians , for each , define the matrix and let
for every .
Our “sub-Gaussian” random process is to pick Rademacher random variables , for each , define the matrix and let
for every .
We will argue that is -dominated by and that .
For the first claim, we see that for every , we can write as
So, as noted in one of our examples above, we can say that
and we see that
so that, indeed, is -dominated by .
Now we need to apply generic chaining to . It is very helpful to note that the distance function defined on by the Gaussian process is dominated by Euclidean distance between the vectors and , because
where we used the inequality
We can conclude that an -net over the unit Euclidean sphere is also a -net for the metric space . For the unit Euclidean sphere there is an -net of size at most . To apply generic chaining, let be an arbitrary subset of of cardinality if , and an -net with otherwise. Applying the generic chaining inequality,