CS359G Lecture 7: Computing Eigenvectors

In which we analyze a nearly-linear time algorithm for finding an approximate eigenvector for the second eigenvalue of a graph adjacency matrix, to be used in the spectral partitioning algorithm.

In past lectures, we showed that, if ${G=(V,E)}$ is a ${d}$-regular graph, and ${M}$ is its normalized adjacency matrix with eigenvalues ${1=\lambda_1 \geq \lambda_2 \ldots \geq \lambda_n}$, given an eigenvector of ${\lambda_2}$, the algorithm SpectralPartition finds, in nearly-linear time ${O(|E| + |V|\log |V|)}$, a cut ${(S,V-S)}$ such that ${h(S) \leq 2\sqrt{h(G)}}$.

More generally, if, instead of being given an eigenvector ${{\bf x}}$ such that ${M{\bf x} = \lambda_2 {\bf x}}$, we are given a vector ${{\bf x} \perp {\bf 1}}$ such that ${{\bf x}^T M {\bf x} \geq (\lambda_2 - \epsilon) {\bf x}^T{\bf x}}$, then the algorithm finds a cut such that ${h(S) \leq \sqrt{4h(G) + 2\epsilon}}$. In this lecture we describe and analyze an algorithm that computes such a vector using ${O((|V|+|E|)\cdot \frac 1\epsilon \cdot \log \frac {|V|}{\epsilon})}$ arithmetic operations.

A symmetric matrix is positive semi-definite (abbreviated PSD) if all its eigenvalues are nonnegative. We begin by describing an algorithm that approximates the largest eigenvalue of a given symmetric PSD matrix. This might not seem to help very much because the adjacency matrix of a graph is not PSD, and because we want to compute the second largest, not the largest, eigenvalue. We will see, however, that the algorithm is easily modified to approximate the second eigenvalue of a PSD matrix (if an eigenvector of the first eigenvalue is known), and that the adjacency matrix of a graph can easily be modified to be PSD.

1. The Power Method to Approximate the Largest Eigenvalue

The algorithm works as follows

Algorithm Power

• Input: PSD symmetric matrix ${M\in {\mathbb R}^{n\times n}}$, positive integer ${t}$
• Pick uniformly at random ${{\bf x}_0 \sim \{-1,1\}^n}$
• for ${i:=1}$ to ${t}$
• ${{\bf x}_i := M{\bf x}_{i-1}}$
• return ${{\bf x}_t}$

That is, the algorithm simply picks uniformly at random a vector ${{\bf x}}$ with ${\pm 1}$ coordinates, and outputs ${M^t{\bf x}}$.

Note that the algorithm performs ${O(t\cdot (n+m))}$ arithmetic operations, where ${m}$ is the number of non-zero entries of the matrix ${M}$.

Theorem 1 For every PSD matrix ${M}$, positive integer ${t}$ and parameter ${\epsilon>0}$, with probability ${\geq 3/16}$ over the choice of ${{\bf x}_0}$, the algorithm Power outputs a vector ${{\bf x}_t}$ such that

$\displaystyle \frac{ {\bf x}_t^T M {\bf x}_t }{ {\bf x}_t^T{\bf x}_t}\geq \lambda_1 \cdot (1-\epsilon) \cdot \frac 1 {1+4n (1-\epsilon)^{2t}}$

where ${\lambda_1}$ is the largest eigenvalue of ${M}$.

Note that, in particular, we can have ${t=O(\log n/\epsilon)}$ and ${\frac{{\bf x}_t^TM{\bf x}_t}{{\bf x}_t^T{\bf x}_t} \geq (1-O(\epsilon)) \cdot \lambda_1}$.

Let ${\lambda_1 \geq \cdots \lambda_n}$ be the eigenvalues of ${M}$, with multiplicities, and ${{\bf v}_1,\ldots,{\bf v}_n}$ be a system of orthonormal eigenvectors such that ${M{\bf v}_i = \lambda_i {\bf v}_i}$. Theorem 1 is implied by the following two lemmas

Lemma 2 Let ${{\bf v} \in {\mathbb R}^n}$ be a vector such that ${||{\bf v}||=1}$. Sample uniformly ${{\bf x} \sim \{-1,1\}^n}$. Then

$\displaystyle \mathop{\mathbb P} \left[ |\langle {\bf x},{\bf v}\rangle| \geq \frac 12 \right] \geq \frac 3{16}$

Lemma 3 Let ${{\bf x} \in {\mathbb R}^n}$ be a vector such that ${|\langle {\bf x} , {\bf v}_1\rangle| \geq \frac 12}$. Then, for every positive integer ${t}$ and positive ${\epsilon >0}$, if we define ${{\bf y} := M^t {\bf x}}$, we have

$\displaystyle \frac{{\bf y}^T M {\bf y}}{{\bf y}^T{\bf y}} \geq \lambda_1 \cdot (1-\epsilon) \cdot \frac 1 {1+4 ||{\bf x}||^2 (1-\epsilon)^{2t}}$

It remains to prove the two lemmas.

Proof: (Of Lemma 2) Let ${{\bf v} = (v_1,\ldots,v_n)}$. The inner product ${\langle {\bf x}, {\bf v}\rangle}$ is the random variable

$\displaystyle S:= \sum_i x_i v_i$

Let us compute the first, second, and fourth moment of ${S}$.

$\displaystyle \mathop{\mathbb E} S = 0$

$\displaystyle \mathop{\mathbb E} S^2 = \sum_i v_i^2 = 1$

$\displaystyle \mathop{\mathbb E} S^4 = 3 \left(\sum_i v_i^2 \right) - 2 \sum_i v_i^4 \leq 3$

Recall that the Paley-Zygmund inequality states that if ${Z}$ is a non-negative random variable with finite variance, then, for every ${0\leq \delta \leq 1}$, we have

$\displaystyle \mathop{\mathbb P} [ Z \geq \delta \mathop{\mathbb E} Z ] \geq (1-\delta)^2 \cdot \frac {(\mathop{\mathbb E} Z)^2}{\mathop{\mathbb E} Z^2} \ \ \ \ \ (1)$

which follows by noting that

$\displaystyle \mathop{\mathbb E} Z = \mathop{\mathbb E} [ Z\cdot 1_{Z < \delta \mathop{\mathbb E} Z} ] + \mathop{\mathbb E} [ Z \cdot 1_{Z \geq \delta \mathop{\mathbb E} Z} \ ,$

that

$\displaystyle \mathop{\mathbb E} [ Z\cdot 1_{Z < \delta \mathop{\mathbb E} Z} ] \leq \delta \mathop{\mathbb E} Z \ ,$

and that

$\displaystyle \mathop{\mathbb E} [ Z \cdot 1_{Z \geq \delta \mathop{\mathbb E} Z}] \leq \sqrt {\mathop{\mathbb E} Z^2}\cdot \sqrt{\mathop{\mathbb E} 1_{Z \geq \delta \mathop{\mathbb E} Z}}$

$\displaystyle = \sqrt {\mathop{\mathbb E} Z^2}\sqrt{\mathop{\mathbb P} [ Z \geq \delta \mathop{\mathbb E} Z] }$

We apply the Paley-Zygmund inequality to the case ${Z= S^2}$ and ${\delta = 1/4}$, and we derive

$\displaystyle \mathop{\mathbb P} \left[ S^2 \geq \frac 14 \right] \geq \left( \frac 34 \right)^2 \cdot \frac 13 = \frac 3{16}$

$\Box$

Remark 1 The proof of Lemma 2 works even if ${{\bf x} \sim \{-1,1\}^n}$ is selected according to a 4-wise independent distribution. This means that the algorithm can be derandomized in polynomial time.

Proof: (Of Lemma 3) Let us write ${{\bf x}}$ as a linear combination of the eigenvectors

$\displaystyle {\bf x} = a_1 {\bf v}_1 + \cdots + a_n {\bf v}_n$

where the coefficients can be computed as ${a_i = \langle {\bf x}, {\bf v}_i\rangle}$. Note that, by assumption, ${|a_1| \geq .5}$, and that, by orthonormality of the eigenvectors, ${||{\bf x}||^2 = \sum_i a_i^2}$.

We have

$\displaystyle {\bf y} = a_1\lambda_1^t {\bf v}_1 + \cdots + a_n \lambda_n^t{\bf v}_n$

and so

$\displaystyle {\bf y}^T M {\bf y} = \sum_i a_i^2 \lambda_i^{2t+1}$

and

$\displaystyle {\bf y}^T{\bf y} = \sum_i a_i^2 \lambda_i^{2t}$

We need to prove a lower bound to the ratio of the above two quantities. We will compute a lower bound to the numerator and an upper bound to the denominator in terms of the same parameter.

Let ${k}$ be the number of eigenvalues larger than ${\lambda_1 \cdot (1-\epsilon)}$. Then, recalling that the eigenvalues are sorted in non-increasing order, we have

$\displaystyle {\bf y}^TM{\bf y} \geq \sum_{i=1}^k a_i^2 \lambda_i^{2t+1} \geq \lambda_1(1-\epsilon) \sum_{i=1}^k a_i^2 \lambda_i^{2t}$

We also see that

$\displaystyle \sum_{i=k+1}^n a_i^2\lambda_i^{2t}$

$\displaystyle \leq \lambda_1^{2t} \cdot (1-\epsilon)^{2t} \sum_{i=k+1}^n a_i^2$

$\displaystyle \leq \lambda_1^{2t} \cdot (1-\epsilon)^{2t} \cdot ||{\bf x}||^2$

$\displaystyle \leq 4 a_1^2 \lambda_1^{2t} (1-\epsilon)^{2t} ||{\bf x}||^2$

$\displaystyle \leq 4 ||{\bf x}||^2 (1-\epsilon)^{2t} \sum_{i=1}^k a_i^2 \lambda_i^{2t}$

So we have

$\displaystyle {\bf y}^T{\bf y} \leq (1 +4 ||{\bf x}||^2 (1-\epsilon)^{2t}) \cdot \sum_{i=1}^k a_i^2$

giving

$\displaystyle \frac{{\bf y}^T M {\bf y}}{{\bf y}^T{\bf y}} \geq \lambda_1 \cdot (1-\epsilon) \cdot \frac 1 {1+4 ||{\bf x}||^2 (1-\epsilon)^{2t}}$

$\Box$

Remark 2 Where did we use the assumption that ${M}$ is positive semidefinite? What happens if we apply this algorithm to the adjacency matrix of a bipartite graph?

2. Approximating the Second Eigenvalue

If ${M}$ is a PSD matrix, and if we know a unit-length eigenvector ${{\bf v}_1}$ of the largest eigenvalue of ${M}$, we can approximately find the second eigenvalue with the following adaption of the algorithm from the previous section.

Algorithm Power2

• Input: PSD symmetric matrix ${M\in {\mathbb R}^{n\times n}}$, positive integer ${t}$, vector ${{\bf v}_1}$
• Pick uniformly at random ${{\bf x} \sim \{-1,1\}^n}$
• ${{\bf x}_0:= {\bf x} - \langle {\bf v}_1,{\bf x} \rangle\cdot {\bf v}_1}$
• for ${i:=1}$ to ${t}$
• ${{\bf x}_i := M{\bf x}_{i-1}}$
• return ${{\bf x}_t}$

If ${{\bf v}_1,\ldots,{\bf v}_n}$ is an orthonormal basis of eigenvectors for the eigenvalues ${\lambda_1 \geq \cdots \geq \lambda_n}$ of ${M}$, then, at the beginning, we pick a random vector

$\displaystyle {\bf x} = a_1 {\bf v}_1 + a_2 {\bf v}_2 + \cdots a_n {\bf v}_n$

that, with probability at least ${3/16}$, satisfies ${|a_2| \geq 1/2}$. (Cf. Lemma 2.) Then we compute ${{\bf x}_0}$, which is the projection of ${{\bf x}}$ on the subspace orthogonal to ${{\bf v}_1}$, that is

$\displaystyle {\bf x}_0 = a_2 {\bf v}_2 + \cdots a_n {\bf v}_n$

Note that ${||{\bf x}||^2 = n}$ and that ${||{\bf x}_0||^2 \leq n}$.

The output is the vector ${{\bf x}_t}$

$\displaystyle {\bf x}_t = a_2 \lambda_2^t {\bf v}^2 + \cdots a_n \lambda_n ^t{\bf v}^n$

If we apply Lemma 3 to subspace orthogonal to ${{\bf v}_1}$, we see that when ${|a_2|\geq 1/2}$ we have that, for every ${0< \epsilon< 1}$,

$\displaystyle \frac{ {\bf x}_t^T M {\bf x}_t}{{\bf x}_t^T{\bf x}_t} \geq \lambda_2 \cdot (1-\epsilon) \cdot \frac 1{4 n (1-\epsilon)^{2t}}$

We have thus established the following analysis.

Theorem 4 For every PSD matrix ${M}$, positive integer ${t}$ and parameter ${\epsilon>0}$, if ${{\bf v}_1}$ is a length-1 eigenvector of the largest eigenvalue of ${M}$, then with probability ${\geq 3/16}$ over the choice of ${{\bf x}_0}$, the algorithm Power2 outputs a vector ${{\bf x}_t \perp {\bf v}_1}$ such that

$\displaystyle \frac {{\bf x}_t^T M {\bf x}_t}{ {\bf x}_t^T{\bf x}_t} \geq \lambda_2 \cdot (1-\epsilon) \cdot \frac 1 {1+4n (1-\epsilon)^{2t}}$

where ${\lambda_2}$ is the second largest eigenvalue of ${M}$, counting multiplicities.

Finally, we come to the case in which ${M}$ is the normalized adjacency matrix of a regular graph.

We know that ${M}$ has eigenvalues ${1=\lambda_1 \geq \cdots \lambda_n\geq -1}$ and that ${\frac 1{\sqrt n} \cdot {\bf 1}}$ is an eigenvector of ${\lambda_1}$.

Consider now the matrix ${M+I}$. Every eigenvector of ${M}$ with eigenvalue ${\lambda}$ is clearly also an eigenvector of ${M+I}$ with eigenvalue ${1+\lambda}$, and vice versa, thus ${M+I}$ has eigenvalues ${2=1+\lambda_1 \geq 1+ \lambda_2 \geq \cdots \geq 1+\lambda_n \geq 0}$ and it is PSD.

This means that we can run algorithm Power2 on the matrix ${I+M}$ using the vector ${{\bf v}_1 = \frac 1 {\sqrt n} {\bf 1}}$ and a parameter ${t = O(\epsilon^{-1} \log n/\epsilon))}$. The algorithm finds, with probability ${\geq 3/16}$, a vector ${{\bf x}_t \perp {\bf 1}}$ such that

$\displaystyle \frac{{\bf x}_t^T (M+I) {\bf x}_t}{{\bf x}_t^T{\bf x}_t} \geq (1+\lambda_2)\cdot (1-2\epsilon)$

which is equivalent to

$\displaystyle \frac{{\bf x}_t^T M {\bf x}_t}{{\bf x}_t^T{\bf x}_t} \geq \lambda_2 - 2\epsilon -2\epsilon\lambda_2 \geq \lambda_2 - 4\epsilon$

3 thoughts on “CS359G Lecture 7: Computing Eigenvectors”

1. What is a good reference for this material? I expect that the complexity of something so basic as eigenvalue computation has been analyzed in the literature extensively.

2. The problem has an enormous body of literature in the numerical analysis community, but I am not very familiar with that literature.

The 1992 paper Estimating the largest eigenvalues by the power and Lanczos algorithms with a random start by Kuczyński and Woźniakowski shows that, to compute the largest eigenvalue of a PSD matrix with relative error $\epsilon$, $O(\epsilon^{-1} \log n)$ iterations of the power method are enough (this is better than the $O(\epsilon^{-1} \log n/\epsilon)$ bound that I prove in the post because I only look at one scale of $\langle x,v \rangle$ instead of averaging over all scales); they have an explicit bound on the leading constant, and they show matrices for which there is a $\Omega( \epsilon^{-1} \log n)$ lower bound.

For the Lanczos algorithm, which is known to perform better in practice, they prove that $\sqrt{\epsilon^{-1}} \log n$ iterations are sufficient.

An application of the Spielman-Teng solver for diagonally-dominated systems of linear equations is that, for the normalized adjacency matrix of a graph, one can find a constant-factor approximation of $1- \lambda_2$ in time $n (\log n)^{O(1)}$, independent of the value of $\lambda_2$, while the approach described in this post has complexity that grows nearly linearly in $n/(1-\lambda_2)$, which can be quadratic or even cubic in $n$ in very non-expanding graphs.