Beyond Worst-Case Analysis: Lecture 9

Scribed by Chinmay Nirkhe

In which we explore the Stochastic Block Model.

1. The {G_{n,p,q}} problem

The Stochastic Block Model is a generic model for graphs generated by some parameters. The simplest model and one we will consider today is the {G_{n,p,q}} problem.

Definition 1 ({G_{n,p,q}} graph distribution) The {G_{n,p,q}} distribution is a distribution on graphs of {n} vertices where {V} is partitioned into two 2 subsets of equal size: {V = V_1 \sqcup V_2}. Then for {\{i,j\}} pair of vertices in the same subset, {\Pr( (i,j) \in E) = p} and otherwise {\Pr( (i,j) \in E) = q}.

We will only consider the regime under which {p > q}. If we want to find the partition {V = V_1 \sqcup V_2}, it is intuitive to look at the problem of finding the minimum balanced cut. The cut {(V_1, V_2)} has expected size {q n^2 / 4} and any other cut will have greater expected size.

Our intuition should be that as {p \rightarrow q}, the problem only gets harder. And for fixed ratio {p/q}, as {p,q \rightarrow 1}, the problem only gets easier. This can be stated rigorously as follows: If we can solve the problem for {p,q} then we can also solve it for {cp, cq} where {c > 1}, by keeping only {1/c} edges and reducing to the case we can solve.

Recall that for the {k}-planted clique problem, we found the eigenvector {{\bf x}} corresponding to the largest eigenvalue of {A - \frac{1}{2} J}. We then defined {S} as the vertices {i} with the {k} largest values of {|x_i|} and cleaned up {S} a little to get our guess for the planted clique.

In the Stochastic Block Model we are going to follow a similar approach, but we are instead going to find the largest eigenvalue of {A - \left( \frac{p + q}{2} \right) J}. Note this is intuitive as the average degree of the graph is {p(n/2 - 1) + q(n/2) \approx \frac{n}{2} (p+q)}. The idea is simple: Solve {{\bf x}} the largest eigenvector corresponding to the largest eigenvalue and define

\displaystyle  V_1 = \{ i : x_i > 0\}, \qquad V_2 = \{ i : x_i \leq 0 \} \ \ \ \ \ (1)

As we proceed to the analysis of this procedure, we fix {V_1, V_2}. Prior to fixing, the adjacency matrix {A} was {\left( \frac{p+q}{2} \right) J}.\footnote{The diagonal should be zeroes, but this is close enough.} Upon fixing {V_1, V_2}, the average adjacency matrix {R} looks different. For ease of notation, if we write a bold constant {{\bf c}} for a matrix, we mean the matrix {cJ}. It will be clear from context.

\displaystyle  R = \left( \begin{array}{c | c} {\bf p} & {\bf q} \\  {\bf q} & {\bf p} \end{array} \right) \ \ \ \ \ (2)

Here we have broken up {R} into blocks according to the partition {V_1, V_2}.

Theorem 2 If {p, q > \log n / n} then with high probability, {\lVert A - R \rVert < O \left(\sqrt{n(p+q)} \right)}.

Proof: Define the graph {G_1} as the union of a {G_{n/2, p}} graph on {V_1} and {G_{n/2, p}} graph on {V_2}. Define the graph {G_2} a a {G_{n, q}} graph. Note that the graph {G} is distributed according to picking a {G_1} and {G_2} graph and adding the partition crossing edges of {G_2} to {G_1}. Let {A_1} and {A_2} be the respective adjacency matrices and define the follow submatrices:

\displaystyle  A_1 = \left( \begin{array}{c | c} A_1' & \\  & A_1'' \end{array} \right), \qquad A_2 = \left( \begin{array}{c | c} A_2' & A_2''' \\  {A_2'''}^\dagger & A_2'' \end{array} \right). \ \ \ \ \ (3)

Then the adjacency matrix {A} is defined by

\displaystyle  A = A_1 + A_2 - \left( \begin{array}{c | c} A_2' & \\  & A_2'' \end{array} \right) \ \ \ \ \ (4)

Similarly, we can generate a decomposition for {R}:

\displaystyle  R = \left( \begin{array}{c | c} {\bf p} & \\  & {\bf p} \end{array} \right) + \bigg ( {\bf q} \bigg ) - \left( \begin{array}{c | c} {\bf q} & \\  & {\bf q} \end{array} \right). \ \ \ \ \ (5)

Then using triangle inequality we can bound {\lVert A - R \rVert} by bounding the difference in the various terms.

\displaystyle  \begin{aligned} \lVert A - R \rVert &\leq \left \lVert A_1 - \left(\begin{array}{c | c} {\bf p} & \\  & {\bf p} \end{array} \right) \right \rVert + \left \lVert A_2 - ({\bf q}) \right \rVert + \left \lVert \left( \begin{array}{c | c} A_2' & \\  & A_2'' \end{array} \right) - \left( \begin{array}{c | c} {\bf q} & \\  & {\bf q} \end{array} \right) \right \rVert \\ &\leq O(\sqrt{np}) + O(\sqrt{nq}) + O(\sqrt{nq}) \\ &= O \left( \sqrt{n (p+q)} \right) \end{aligned} \ \ \ \ \ (6)

The last line follows as the submatrices are adjacency matrices of {G_{n,p}} graphs and we can apply the results we proved in that regime for {p, q > \log n / n}. \Box

But the difficulty is that we don’t know {R} as {R = R(V_1, V_2)}. If we knew {R}, then we would know the partition. What we can compute is {\left \lVert A - \left(\frac{p+q}{2} \right) J \right \rVert}.\footnote{The rest of this proof actually doesn’t even rely on knowing {p} or {q}. We can estimate {p + q} by calculating the average vertex degree.} We can rewrite {R} as

\displaystyle  R = \left( \frac{p+q}{2}\right) J + \frac{p-q}{2} \left( \begin{array}{c | c} {\bf 1} & - {\bf 1} \\  - {\bf 1} & {\bf 1} \end{array} \right) \ \ \ \ \ (7)

Call the matrix on the right {C}. It is clearly rank-one as it has decomposition {n \chi \chi^\dagger} where {\chi = \frac{1}{\sqrt{n}}\left( \begin{array}{c} {\bf 1} \\ -{\bf 1} \end{array} \right)}. Therefore

\displaystyle  \left \lVert \left( A - \left( \frac{p+q}{2} \right) J \right) - \left( \frac{p-q}{2} \right) C \right \rVert = \left \lVert A - R \right \rVert \leq O \left( \sqrt{n (p+q)} \right). \ \ \ \ \ (8)

Then {A - \left( \frac{p+q}{2} \right) J} is close (in operator norm) to the rank 1 matrix {\left( \frac{p-q}{2} \right) C}. Then their largest eigenvalues are close. But since {\left( \frac{p-q}{2} \right) C} has only one non-zero eigenvalue {\chi}, finding the corresponding eigenvector to the largest eigenvalue of {A - \left( \frac{p+q}{2} \right) J} will be close to the ideal partition as {C} describes the ideal partition. This can be formalized with the Davis-Kaham Theorem.

Theorem 3 (Davis-Kahan) Given matrices {M, M'} with {\lVert M - M' \rVert \leq \varepsilon} where {M} has eigenvalues {\lambda_1 \leq \ldots \leq \lambda_n} and corresponding eigenvectors {{\bf v}_1, \ldots, {\bf v}_n} and {M'} has eigenvalues {\lambda_1' \leq \ldots \leq \lambda_n'} and corresponding eigenvectors {{\bf v}_1', \ldots, {\bf v}_n'}, then

\displaystyle  \sin \left( \mathrm{angle} \left( \mathrm{span}({\bf v}_1), \mathrm{span}({\bf v}_1') \right) \right) \leq \frac{\varepsilon}{|\lambda_1' - \lambda_2|} \leq \frac{\varepsilon}{|\lambda_1 - \lambda _2 - \varepsilon|}. \ \ \ \ \ (9)

Equivalently,

\displaystyle  \min \left \{ \lVert {\bf v}_1 \pm {\bf v}_1' \rVert \right \} \leq \frac{\sqrt{2} \varepsilon}{\lambda_1 - \lambda_2 - \varepsilon}. \ \ \ \ \ (10)

The Davis Kahan Theorem with {M' = A - \left( \frac{p+q}{2} \right) J, M = \left( \frac{p-q}{2} \right) C}, and {\varepsilon = O \left( \sqrt{n (p+q)} \right)} states that

\displaystyle  \min \left \{ \lVert {\bf v}' \pm \chi \rVert \right \} \leq O \left( \frac{\sqrt{a + b}}{a - b - O \left( \sqrt{a + b} \right)} \right) \ \ \ \ \ (11)

where {{\bf v}'}, the eigenvector associated with the largest eigenvalue of {A - \left( \frac{p+q}{2} \right) J} and {a = pn /2, b = qn/2}, the expected degrees of the two parts of the graph. Choose between {\pm {\bf v}'} for the one closer to {\chi}. Then

\displaystyle  \lVert {\bf v}' - \chi \rVert^2 \leq O \left( \left( \frac{\sqrt{a + b}}{a - b - O\left(\sqrt{a + b}\right)} \right)^2 \right). \ \ \ \ \ (12)

Recall that {\sum_i (v_i' - \chi_i)^2 = \lVert {\bf v}' - \chi \rVert^2}. If {v_i'} and {\chi_i} disagree in sign, then this contributes at least {1/n} to the value of {\lVert {\bf v}' - \chi \rVert^2}. Equivalently, {n \cdot \lVert {\bf v}' - \chi \rVert^2} is at least the number of misclassified vertices. It is simple to see from here that if {a - b \geq c_\varepsilon \sqrt{a + b}} then we can bound the number of misclassified vertices by {\varepsilon n}. This completes the proof that the proposed algorithm does well in calculating the partition of the Stochastic Block Model.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s