Welcome to phase two of in theory, in which we again talk about math. I spent last Fall teaching two courses and getting settled, I mostly traveled in January and February, and I have spent the last two months on my sofa catching up on TV series. Hence I will reach back to last Spring, when I learned about Talagrand’s machinery of generic chaining and majorizing measures from Nikhil Bansal, in the context of our work with Ola Svensson on graph and hypergraph sparsification. Here I would like to record what I understood about the machinery, and in a follow-up post I plan to explain the application to hypergraph sparsification.
1. A Concrete Setting
Starting from a very concrete setting, suppose that we have a subset , we pick a random Gaussian vector
from
, and we are interested in the random variable
In theoretical computer science, for example, a random variable like (1) comes up often in the study of rounding algorithms for semidefinite programming, but this is a problem of much broader interest.
We will be interested both in bounds on the expectations of (1) and on its tail, but in this post we will mostly reason about its expectation.
A first observation is that each is Gaussian with mean zero and variance
. If
is finite, we can use a union bound to estimate the tail of
as
and we can compute the upper bound
The above bound can be tight, but it is poor if the points of are densely clustered, and it is useless if
is infinite.
It is useful to note that, if we fix, arbitrarily, an element , then we have
because . The latter expression is nicer to work with because it makes it more explicit that what we are trying to compute is invariant under shifts of
, and only depends on pairwise distances of the elements of
, rather than their norm.
In the cases in which (2) gives a poor bound, a natural approach is to reason about an -net
of
, that is, a subset
such that for every
there is an element
such that
. Then we can say that
which can be a much tighter bound. Notice that we used (2) to bound
but it might actually be better to find an -net
of
, and so on. In general, a tighter analysis would be to choose a sequence of nested sets
, where
,
, and we have that
is an
-net of
, that is, for every element
of
there is an element
such that
. Then, by generalizing the above reasoning, we get
Finally, if the cardinality of the grows sufficiently fast, namely, if we have
, it is possible to refine the estimate to
where is the closest element to
in
. This is done by avoiding to write the expectation of the sup of a sum as a sum of expectations of sups and then using (2), but by bounding the tail of the sup of the sum directly.
At this point, we do not even need to assume that , that
is finite, or that the sequence of sets is finite, and we have the following result.
Theorem 1 (Talagrand’s generic chaining inequality) Let
be an arbitrary set, let
be a countable sequence of finite subsets of
such that
and
. Then
where
is the element of
closest to
.
A short complete proof is in these notes by James Lee.
While the above Theorem has a very simple proof, the amazing thing, which is rather harder to prove, is that it is tight, in the sense that for every there is a sequence of sets such that the bound of the above theorem has a matching lower bound, up to an absolute constant. This is why it is called generic chaining. Chaining because the projection
of
on
is estimated based on the “chain”
of projections of the intermediate steps of a path that goes from to
passing through the
. Generic because this upper bound technique works as well as any other possible upper bound, up to an absolute constant.
2. An Abstract Setting
Let now be a completely arbitrary set, and suppose that we have a distribution
over functions
and we want to upper bound
That is, we have a random optimization problem with a fixed feasible set , and we want to know the typical value of the optimum. For example,
could be the set of cuts of a vertex set
, and
describe a distribution of random graphs
such that
is the number of edges cut in a random graph by the cut
. Then the above problem is to estimate the average value of the max cut in the random graphs of the distribution. Or
could be the unit sphere
and
describe a distribution of random Hermitian matrices such that
is the quadratic form of a random matrix evaluated at
. In this case, the above problem is to estimate the average value of the largest eigenvalue of such a random matrix.
We will call the collection of random variables a random process, where
is a random variable distributed according to
.
If every , and every finite linear combination
, has a Gaussian distribution, then we say that
is a Gaussian process, and if, in addition,
for every
then we say that it is a centered Gaussian process.
If , and we define
for a random standard Gaussian
, then
is a centered Gaussian process and, in this case, upper bounding
is precisely the problem we studied before.
If and
for a random standard Gaussian
, then, for every
, we have
and, by analogy, if is a centered Gaussian process, we will define the following distance function on
:
If is a centered Gaussian process then one can prove that the above distance function is a semi-metric on
.
We will not need this fact, but if is a centered Gaussian process and
is finite, then there is an embedding
, for some
, such that the process can be equivalently defined as picking
and setting
, so that
is also an isometric embedding of the above distance function into
.
The arguments of the previous section apply to centered Gaussian processes without change, and so we have.
Theorem 2 Let
be an arbitrary set, and
be a centered Gaussian process. Let
be a countable sequence of finite subsets of
such that
and
. Then
where
is the distance function
and
is the element of
closest to
according to
.
3. Sub-Gaussian Random Processes
This theory does not seem to apply to problems such as bounding the max cut of a unweighted graph, or bounding the largest eigenvalue of a random symmetric matrix with
entries, because such problems have a finite sample space and so cannot be modeled as Gaussian processes.
Fortunately, there is a notion of a sub-Gaussian process, which applies to such problems and which reduces their analysis to the analysis of a related Gaussian process.
First, recall that a centered real-valued random variable is sub-Gaussian if there is a centered Gaussian random variable whose tail dominates the tail of
, that is, if we have two constants
and
such that, for all
:
An equivalent condition is that there is a such that
In that case, we can define a norm, called the norm as
which is, roughly, the standard deviation of a centered Gaussian that dominates .
Example 1 All bounded random variables are sub-Gaussian.
Example 2 If
where the
are independent Rademacher random variables, that is, if each
is equally likely to be +1 or -1, then
is sub-Gaussian with
, which is within a constant factor of its actual standard deviation.
Example 3 If
where the
are independent and each
has probability
of being equal to
and probability
of being equal to
(that is, each
is a centered Bernoulli random variable), then
, which is much more than the standard deviation of
when
is small.
Example 4 If
where the
are independent Rademacher random variables, and the
are arbitrary real scalars, then
, which is within a constant factor of the standard deviation that we would get by replacing each
with a standard Gaussian.
Let now be a centered random process. We will say that a Gaussian process
-dominates
if, for every
we have
That is, every random variable of the form is sub-Gaussian, and its tail is dominated by the tail of the Gaussian distribution
Theorem 3 (Talagrand’s comparison inequality) There is an absolute constant
such that if
is a centered random process that is
-dominated by a centered Gaussian random process
, then
Furthermore, for every
,
where
is the diameter of
with respect to the distance function
.
The way to apply this theory is the following.
Suppose that we want estimate, on average or with high probability, the optimum of an optimization problem with feasible set over the randomness of the choice of a random instance. We model this problem like a centered random process
in which
is the difference between the cost of solution
in a random instance minus the average cost of
.
Then we think about a related random experiment, in which the random choices involved in constructing our instance are replaced by Gaussian choices (for example, instead of a random graph we may think of a complete graph with Gaussian weights on the edges chosen with expectation 1/2 and constant variance) and we let
be the analogous process in this Gaussian model.
If we can argue that dominates
, then it remains to estimate
, which we can do either by the generic chaining theorem or by other methods.
4. An Example
We will now use this machinery to show that the largest eigenvalue of a random symmetric matrix with Rademacher entries is
. This is certainly not the simplest way of proving such a result, but it will give a sense of how these techniques can be applied.
We let be the unit sphere.
Our Gaussian process will be to pick standard Gaussians , for each
, define the matrix
and let
for every .
Our “sub-Gaussian” random process is to pick Rademacher random variables , for each
, define the matrix
and let
for every .
We will argue that is
-dominated by
and that
.
For the first claim, we see that for every , we can write
as
So, as noted in one of our examples above, we can say that
and we see that
so that, indeed, is
-dominated by
.
Now we need to apply generic chaining to . It is very helpful to note that the distance function defined on
by the Gaussian process is dominated by Euclidean distance between the vectors
and
, because
where we used the inequality
We can conclude that an -net over the unit Euclidean sphere is also a
-net for the metric space
. For the unit Euclidean sphere there is an
-net of size at most
. To apply generic chaining, let
be an arbitrary subset of
of cardinality
if
, and an
-net with
otherwise. Applying the generic chaining inequality,
Very nice!
Nice post! I think there is a missing parenthesis in the displayed equation defining d(t1,t2).
The equation for the definition of the Psi2 norm does not show correctly.
Thanks for the corrections
Pingback: To cheer you up in difficult times 3: A guest post by Noam Lifshitz on the new hypercontractivity inequality of Peter Keevash, Noam Lifshitz, Eoin Long and Dor Minzer | Combinatorics and more
Pingback: Spielman-Srivastava Sparsification à la Talagrand | in theory
I was wondering, how do you obtain the bound for (2) when T is finite? I guess you integrate P(X > l) but I was wondering if there was a trick to get that bound?
Pingback: An upper bound on Gaussian mean width – Mathematics – Forum