In which we prove Bourgain’s theorem.
Today we prove the following theorem.
Theorem 1 (Bourgain) Let
be a semimetric defined over a finite set
. Then there exists a mapping
such that, for every two elements
,
where
is an absolute constant. Given
, the mapping
can be found with high probability in randomized polynomial time in
.
Together with the results that we proved in the last lecture, this implies that an optimal solution to the Leighton-Rao relaxation can be rounded to an -approximate solution to the sparsest cut problem. This was the best known approximation algorithm for sparsest cut for 15 years, until the Arora-Rao-Vazirani algorithm, which will be our next topic.
The theorem has a rather short proof, but there is an element of “magic” to it. We will discuss several examples and we will see what approaches are suggested by the examples. At the end of the discussion, we will see the final proof as, hopefully, the “natural” outcome of the study of such examples and failed attempts.
1. Preliminary and Motivating Examples
A first observation is that embeddings of finite sets of points into L1 can be equivalently characterized as probabilistic embeddings into the real line.
Fact 2 For every finite set
, dimension
, and mapping
, there is a finitely-supported probability distribution
over functions
such that for every two points
:
Conversely, for every finite set
and finitely supported distribution
over functions
, there is a dimension
and a mapping
such that
Proof: For the first claim, we write for the
-th coordinate of
, that is
, and we define
to be the uniform distribution over the
functions of the form
.
For the second claim, if the support of is the set of functions
, where function
has probability
, then we define
.
It will be easier to reason about probabilistic mappings into the line, so we will switch to the latter setting from now on.
Our task is to associate a number to every point , and the information that we have about
is the list of distances
. Probably the first idea that comes to mind is to pick a random reference vertex
, and work with the mapping
, possibly scaled by a multiplicative constant. (Equivalently, we can think about the deterministic mapping
, in which the vertex
is mapped to the sequence
, for some enumeration
of the elements of
.)
This works in certain simple cases.
Example 1 (Cycle) Suppose that
is the shortest-path metric on a cycle, we can see that, for every two points on the cycle,
is within a constant factor of their distance
. (Try proving it rigorously!)
Example 2 (Simplex) Suppose that
for every
, and
. Then, for every
, we have
, so, up to scaling, the mapping incurs no error at all.
But there are also simple examples in which this works very badly.
Example 3 (1-2 Metric) Suppose that for every
we have
(any distance function satisfying this property is always a metric) and that, in particular, there is a special vertex
at distance 2 from all other vertices, while all other vertices are at distance 1 from each other. Then, for vertices
both different from
we have, as before
but for every
different from
we have
and so our error is going to be
instead of the
that we are trying to establish.
Maybe the next simplest idea is that we should pick at random several reference points . But how do we combine the information
into a single number to associate to
? If we just take the sum of the distances, we are back to the case of sampling a single reference point. (We are just scaling up the expectation by a factor of
.)
The next simplest way to combine the information is to take either the maximum or the minimum. If we take the minimum, we see that we have the very nice property that we immediately guarantee that our distances in the L1 embedding are no bigger than the original distances, so that it “only” remains to prove that the distances don’t get compressed too much.
Fact 3 Let
be a semimetric and
be a non-empty subset of points. Define
as
Then, for every two points
we have
Proof: Let be the point such that
and
be the point such that
. (It’s possible that
.) Then
and, similarly,
Is there a way to sample a set such that, for every two points
, the expectation
is not too much smaller than
? How large should the set
be?
Example 4 (1-2 Metric Again) Suppose that for every
we have
, and that we pick a subset
uniformly at random, that is, each event
has probability
and the events are mutually independent.
Then for every
:
because with probability
the set
contains exactly one of the elements
, and conditioned on that event we have
(because one of
is zero and the other is at least one), which is at least
.
If we pick uniformly at random, however, we incur an
distortion in the case of the shortest path metric on the cycle. In all the examples seen so far, we can achieve constant distortion if we “mix” the distribution in which
is a random set of size 1 and the one in which
is a chosen uniformly at random among all sets, say by sampling from the former probability with probability
and from the latter with probability
.
Example 5 (Far-Away Clusters) Suppose now that
has the following structure:
is partitioned into clusters
, where
(so
), and we have
for vertices in the same cluster, and
for vertices in different clusters.
If
are in the same cluster, then
and
If
are in different clusters
, then
and
If we want to stick to this approach of picking a set of reference elements according to a certain distribution, and then defining the map
, then the set
must have the property that for every two sets
, there is at least a probability
that
intersects exactly one of
, and we would like
to be as large as possible, because the distortion caused by the mapping will be at least
.
This suggest the following distribution :
- Sample
uniformly at random in
- Sample
by selecting each
, independently, to be in
with probability
and to be in
with probability
.
This distribution guarantees the above property with .
Indeed, the above distribution guarantees a distortion at most in all the examples encountered so far, including the tricky example of the clusters of different size. In each example, in fact, we can prove the following claim: for every two vertices
, there is a scale
, such that conditioned on that scale being chosen, the expectation of
is at least a constant times
. We could try to prove Bourgain’s theorem by showing that this is true in every semimetric.
Let us call the conditional distribution of
conditioned on the choice of a scale
. We would like to prove that for every semimetric
and every two points
there is a scale
such that
which, recalling that for every set
, is equivalent to arguing that
For this to be true, there must be distances such that
and such that, with constant probability according to
, we have
and
(or vice-versa). For this to happen, there must be a constant probability that
avoids the set
and intersects the set
. For this to happen, both sets must have size
.
This means that if we want to make this “at least one good scale for every pair of points” argument work, we need to show that for every two vertices there is a “large” distance
and a “small” distance
(whose difference is a constant times
) such that a large-radius ball around one of the vertices and a small-radius ball around the other vertex contain roughly the same number of elements of
.
Consider, however, the following example.
Example 6 (Tree) Consider a complete binary tree, and the shortest-path metric
in the tree. Take any two vertices
and
at distance
. If we look at the ball of radius
around
and the ball of radius
around
, we see that the former has
points in it, and the latter has
points: it is clearly hopeless to have constant probability of hitting the former and of missing the latter.
For every
, however, we have
because there is a constant probability of hitting one of the
points at distance
from
, so that
and also a constant probability of missing the
points at distance
from
, in which case
. This is still good, because averaging over all scales we still get
but this example shows that the analysis cannot be restricted to one good scale but, in some cases, we have to take into account the contribution to the expectation coming from all the scales.
In the above example, the only way to get a ball around and a ball around
with approximately the same number of points is to get balls of roughly the same radius. No scale could then give a large contribution to the expectation
; every scale, however, gave a noticeable contribution, and adding them up we had a bounded distortion. The above example will be the template for the full proof, which will do an “amortized analysis” of the contribution to the expectation coming from each scale
, by looking at the radii that define a ball around
and a ball around
with approximately
elements.
2. The Proof of Bourgain’s Theorem
Given Fact 2 and Fact 3, proving Bourgain’s theorem reduces to proving the following theorem.
Theorem 4 For a finite set of points
, consider the distribution
over subsets of
sampled by uniformly picking a scale
and then picking independently each
to be in
with probability
. Let
be a semimetric. Then for every
,
where
is an absolute constant.
Proof: For each , let
be the distance from
to the
-th closest point to
(counting
). That is,
and
and define similarly. Let
be the scale such that both
and
are smaller than
, but at least one of
or
are
.
Define
and similarly
We claim that there is an absolute constant such that for every scale
, we have
We prove the claim by showing that there are two disjoint events, each happening with probability , such that in one event
, and in the other event
.
- The first event is that
avoids the set
and intersects the set
. The former set has size
, and the latter set has size
; the sets are disjoint because we are looking at balls or radius
around
and
; so the event happens with a probability that is at least an absolute constant. When the event happens,
- The second event is that
avoids the set
and intersects the set
. The former set has size
, and the latter set has size
; the sets are disjoint because we are looking at balls or radius
around
and
; so the event happens with a probability that is at least an absolute constant. When the event happens,
So we have established (1). Averaging over all scales, we have
There is one remaining point to address. In Fact 2, we proved that a distribution over embeddings on the line can be turned into an L1 embeddings, in which the number of dimensions is equal to the size of the support of the distribution. In our proof, we have used a distribution that ranges over possible functions, so this would give rise to an embedding that uses a superpolynomial number of dimensions.
To fix this remaining problem, we sample sets
and we define the embedding
. It remains to prove that this randomized mapping has low distortion with high probability, which is an immediate consequence of the Chernoff bounds. Specifically, we use the following form of the Chernoff bound:
Lemma 5 Let
be independent nonnegative random variables such that, with probability 1,
. Let
. Then
Let us look at any two vertices . Clearly, for every choice of
, we have
so it remains to prove a lower bound to their L1 distance. Let us call
the random variable denoting their L1 distance, that is
We can write where
, so that
is the sum of identically distributed nonnegative random variables, such that
Applying the Chernoff bound with and
, we have
which is, say, if we choose
for an absolute constant
.
By taking a union bound over all pairs of vertices,
On Example 2 (Simplex), in order to show that the right inequality of Bourgain’s theorem holds we have to establish that:
d(u,v) < E_r|d(r,u) – d(r,v)| c logn
I can not see how is this possible; since we prove that E_r|d(r,u) – d(r,v)| = (2/n) d(u,v) we equivalently need to show that
(n/2) E_r|d(r,u) – d(r,v)| n logn thus the inequality we want to prove does not hold.