In which we go over a more powerful (but difficult to compute) alternative to the spectral norm, and discuss how to approximate it.
Today we’ll discuss a solution to the issue of high-degree vertices distorting spectral norms, which will prepare us for next lecture’s discussion on community detection in the stochastic block model using SDP. We’ll discuss a new kind of norm, the infinity-to-one norm, and find an efficient way to approximate it using SDP.
1. Fun with Norms
In the past few lectures, we’ve been heavily relying on the spectral norm,
which is efficiently computable and turns out to be pretty handy in a lot of cases.
Unfortunately, high-degree vertices have a disproportionately large influence on the spectral norm’s value, limiting its usefulness in graphs where such outliers exist. Often (as we did in lecture 9), people will try to modify the input so that there are no vertices of overly high degree or add some regularizing term to ameliorate this issue. Unfortunately, this can lead to less useful results – in lecture 9, for instance, we derived a bound that required that all high-degree vertices be excised from the input graph first.
In this lecture, we’ll attack this problem in a different way by introducing a different norm, the infinity-to-one norm, defined as follows:
It can be shown that
So spectral norm always gives us a bound for the infinity-to-one norm. However, the infinity-to-one norm can come in even more handy than the spectral norm (if we can actually calculate it):
Theorem 1 Let be a symmetric matrix with entries in . Pick a random graph such that is in w.p. . Then whp:
For context, recall that we proved a similar bound in Lecture 9 for the spectral norm, but we required that all nodes with degree greater than twice the average degree be removed first. The zero-to-one norm allows us to bypass that restriction entirely.
Proof: Fix . Examine the expression
We want to make sure that this probability exponentially decreases w.r.t. .
Recall that Bernstein’s inequality states that given independent random variables, absolutely bounded by , with expectation , we have .
is either or and therefore is bounded by . Since , we can have the from Bernstein’s inequality take on value . Combining this with the fact that
Bernstein’s inequality gives us
provided that , as desired.
So this norm allows us to easily sidestep the issue of high-degree vertices. The problem is that it’s NP-hard to compute.
2. Grothendieck’s Inequality
However, it turns out that we can approximate the infinity-to-one norm to within a constant factor using SDP:
Theorem 2 (Grothendieck’s Inequality) There exists some (turns out to be around , but we won’t worry about its exact value here) such that for all ,
So instead of dealing the combinatorially huge problem of optimizing over all vectors, we can just solve an SDP instead to get a good approximation. For convenience, let’s denote the quantity on the left of the above expression as .
Let’s start with a warmup lemma. The proof techniques in this lemma – specifically, the trick of replacing a vector from a continuous distribution with a random vector from some discrete distribution, and then taking the expectation to relate the two quantities, will come in handy later on.
Proof: It is obvious that the expression on the left is at least the right (since it’s just a relaxation). In order to show that the right is at least the left: given some continuous vectors , we can find discrete , vectors such that their expectations are equal to . We can do that by having take on value w.p. and w.p. , likewise for .
That means that:
which gives us the desired result.
Fact 4 is a norm.
Proof: Multiplicative scaling: obvious.
Nonnegativity (except iff ): It is obvious that the SDP norm is zero if is zero.
Now suppose is nonzero.
Notice that we can replace the constraints in the SDP norm requiring that , with , . Why? We’ll use the same trick as we did in the proof of Lemma 3:
Obviously maximizing over , will give us at least as good a result as maximizing over , , since it’s a relaxation, so it suffices to show that if we can obtain some value for using , , we can do at least as well using , .
Let’s suppose we have some vectors with length at most . Now let’s replace them with random vectors of length exactly whose expectation are respectively (just scale the up, and have be either the scaled value or its negative with probability calibrated appropriately so their expectations work out to be ). Then we can just say:
So must take on some values (of length exactly ) that make , as desired.
Since we assumed is nonzero, let , be such that . If we set to be some arbitrary vector of unit length – using a plus if is positive and a minus if it’s negative – and all other to zero (which we can do without affecting the value of the max by the fact we just proved), we can immediately see that
giving a positive lower bound for the maximizer, as desired.
Triangle inequality: Just look at the and that maximize for and , and observe that
since we can always match the quantity on the left hand side by choosing the same and for both terms on the right-hand-side.
2.1. Proof of Grothendieck’s inequality
Now let’s prove Grothendieck’s inequality:
Proof: Observe that, by Lemma 3, maximizing over the choice of in the zero-to-one norm is equivalent to maximizing over the choice of , so we can rewrite our proof goal as
Let be of length and be the optimal choices used in , i.e. the maximizers of . It suffices to prove that there exist vectors (for some fixed constant , since we can just scale the result by changing ) to plug into the right-hand side of the above expression such that .
Pick , with each coordinate being drawn from the normal distribution with mean and variance , and let and . Then
But is just the identity, since the diagonal elements are just the expectation of the square of the Gaussian (which is , its variance) and the off-diagonals are the expectation of the product of two independent Gaussians, which is zero (since the expectation of each individual Gaussian is zero).
At this point we’ve got something that looks a lot like what we want: if the expectation of is equal to , then maximizing over them is definitely going to give us a quantity greater than or equal to . Unfortunately, there’s an issue here: since Gaussian random variables are unbounded, and are unbounded; on the other hand, what we’re trying to do is maximize over vectors whose elements are bounded by .
Our approach will be to “clip” the and to a finite range, and then bound the error introduced by the clipping process. Formally, fix a constant and pick a Gaussian random vector . Now define a ‘clipped’ inner product:
and likewise for . For convenience, let’s define the truncation error as follows, to represent how far the clipped value differs from the actual value.
So we can rewrite as:
and similarly we can define:
So now we have
Clearly is just the SDP norm , since and were the ‘original’ and .
What we will show now is that the remaining three terms are bounded by constant factors of SDP norm, so the entire sum of all four terms getting a constant factor approximation of it. The analysis is the same for all three terms so, for brevity, we’ll just look at the last one:
For convenience, let’s define as:
Now, it’s convenient to think of the and as vectors of infinite dimension indexed by input (we’ll bring the dimensionality down to a finite value at the end of the proof). Let’s define the following inner product and norm in this space:
Now, since the Gaussian distribution is rotation-independent (we can just rotate a Gaussian random variable around without changing its distribution), the squared norms of the and the are all the same (since all the , have length , and dotting with them can be thought of a rotation). That means that the above norm takes on the same value for all , , so all we need to do is figure out a constant bound on it.
Fortunately, this value is pretty easy to bound. Notice that the function is zero if the dot product’s absolute value is smaller than . If we have ,
for sufficiently large , which is a constant.
So if the squared norms are bounded above by , which means we can substitute this and the norm bound into the above expression to get:
Now, armed with this, we can conclude (with similar results for the other two error terms)
which tells us that , is within a constant bound of the SDP norm as desired.
There’s only one slightly fishy bit in the proof we used above, though, and that’s the treatment of functions of infinite-dimensional vectors indexed by a Gaussian vector. Let’s conclude by constructing a (finite-dimensional) solution to the SDP from the functions:
Claim: If , then .
So the matrix comprises a nice finite-dimensional solution to the SDP, and we’re done with the proof.
Also, noticing that we have four terms in the expansion of , each one of which is a feasible value for the SDP, we can figure out how much we deviate from the actual optimum, i.e. . Since each term can’t exceed the value of the SDP optimum at all, is separated from by a factor of four, giving us a bound on the constant in the inequality.
To recap, notice that there were two key “tricks” here:
1) Assuming that we were rounding an optimal solution to our SDP (i.e. starting with as optimizers). We don’t get any bounds otherwise!
2) Treating the rounding error itself as a feasible solution of the SDP.
This proof was communicated to us by James Lee.