*Edited 5/7/2018. Thanks to Sam Hopkins for several corrections and suggestions.*

I am revising my notes from the course on “better-than-worst-case” analysis of algorithms. With the benefit of hindsight, in this post (and continuing in future posts) I would like to review again how one applies spectral methods and semidefinite programming to problems that involve a “planted” solution, and what is the role of concentration results for random matrices in the analysis of such algorithms.

In general, a “planted” distribution is one where we start from picking a random “planted” solution and then we pick an instance of our problem in which the planted solution is good. Such problems are interesting for the average-case analysis of algorithms, because they provide a non-trivial testing ground to understand why certain algorithms perform better in practice than in the worst case; planted problems are also of great interest in complexity theory, because when they are average-case hard they imply the existence of one-way functions; planted problems are also a rich ground for interdisciplinary work: statisticians recognize them as parametric distributions, where the parameter is the planted solutions, for which one wants to approximate the parameter given one sample, and information theorists think of them as a channel where the sender sends a planted solution and the receiver gets an instance of the problem, so that one recognizes that having a large mutual information between instance and planted solution is a necessary condition for the problem to be solvable.

For the sake of this post, it is helpful to focus on the motivating examples of planted -clique and planted bisection. In planted -clique, we sample a graph from the distribution, then we randomly select vertices and we add all the necessary edges to make those edges a clique. In the planted bisection, or stochastic block model, distribution, we randomly split a set of vertices into two subsets of equal size and then we select edges so that edges within and within have probability , and edges crossing the cut have probability , with .

A graph sampled from the planted clique distribution is a lot like a graph, except for the extra edges that create a large clique, and a graph sampled from the distribution is a lot like a random graph, except for the cut , which is crossed by roughly edges, is sparser than a typical balanced cut, which is crossed by about edges.

**1. The Search Problem, the Decision Problem, and the Certification Problem **

Having defined these distributions, our problem of interest is:

- The
*search*problem: given a sample from the planted distribution, find the planted solution, or an approximation, in a useful sense, of the planted solution.

Usually, in planted problems, the solution that is planted is of a kind that exists only with negligible, or even exponentially small, probability in the “non-planted” distribution that we start from. Thus, an algorithm that solves the search problem with noticeable probability also solve the following problem with probability close to :

- The
*decision*problem of distinguishing a sample of the planted distribution from a sample of the analogous “non-planted” distribution.

For small values of , achieving distinguishing probability in the decision problem can be (seemingly) much easier than solving the search problem with probabiltiy . For example, if , we do not know any polynomial time algorithm that solves the search problem for planted -clique with success probability, but checking whether the number of edges is solves the distinguishing problem with distinguishing probability order of . Even if we change our definition of our “non-planted” distribution so that it has edges on average, the number of triangles can still be used to noticeably distinguish it from the non-planted distribution.

In the stochastic block model if we call the expected “internal” degree of a vertex (that is, the number of neighbors on the same side of the planted cut) and the expected “external” degree, and if and are absolute constants, then it is known that the search problem is not solvable, even in an approximate sense, when , as proved by Mossel, Neeman and Sly (note that our definition of and is off from theirs by a factor of 2). Provided that and are absolute constants and , however, one can distinguish from with constant distinguishing probability. This is because the number of triangles (counted as ordered sequences of three vertices) in is, on average, , while it is, on average, in ; furthermore, the variance of both distributions is an absolute constants (those calculations appear in the above paper of Mossel et al.). Now if we have two nonnegative integer random variables and such that the expectations and variances of both are absolute constants and , then there is a threshold such that testing whether a number is is a test that distinguishes and with constant probability.

As far as I am aware, in all the known cases where the decision problem is efficiently solvable with distinguishing probability then it is also known how to efficiently solve the search problem (perhaps approximately) with probability, although algorithms that solve the distinguishing problem don’t always give insights into how to solve the search problem. Mossel et al., for example, show that the decision problem for the stochastic block model is efficiently solvable with distinguishing probability when , by counting the number of simple cycles of length around , and they could use the approach of counting simple cycles to approximate and , but they could not show how to solve the search problem in the regime, a problem that they solved later with additional techniques (the problem was also solved by Massoulie independently).

In summary, efficiently solving the decision problem is a necessary but not sufficient condition for being able to efficiently solve the search problem, and it is a good first step to understand the complexity of a given planted problem.

Another approach to distinguish the distribution from the planted -clique distribution is to able to certify, given a graph sampled from , that its maximum clique size is , a fact that is going to be true with high probability if . Assuming that, like in the planted clique problem, we are discussing instances of an optimization problem, we thus have the following problem:

- The
*certification*problem: given an instance sampled from the “non-planted” distribution, produce a certificate that the optimum value for the instance is worse than the optimum value of all (or all but a negligible fraction of) instances from the planted distribution.

Note that if you can solve the certification with probability , then you can also solve the decision problem with distinguishing probability (or minus a negligible term, if the certified property is allowed to hold with negligible probability in the planted distribution).

In some interesting cases, the certification problem appears to be harder than the decision problem and the search problem.

In the stochastic block model, for example, if one takes the min bisection problem to be the underlying optimization problem, the search problem is solvable (in the sense that one can recover a partition that correlates to the planted partition) whenever , as discussed above, but for very close to it is an open question whether even knowing the value of the min bisection exactly can be used to solve the decision problem.

Another interesting example is “planted Max 3SAT.” Suppose that we construct a satisfiable instance of 3SAT in the following way: we start from a random assignment to variables, and then we pick at random clauses among the clauses that are consistent with the “planted” assignment. If is, say , then it is very easy to solve the search problem (and hence the distinguishing problem): the value of every variable in the planted assignment can be deduced by looking at whether the variable appears more often complemented or not complemented. In the non-planted distribution in which we pick clauses among all the possible clauses, however, we don’t know any efficient algorithm that certifies that instance is not satisfiable. Even if we pick clauses that are consistent with both the planted assignment and the negation of the planted assignment (thus eliminating correlations between complementations of variables and the values of the variables in the planted assignment), we still introduce pairwise correlations that can used to find the planted assignment. See the comment by Ryan O’Donnell below for more information.

When we are dealing with a hard optimization problem, a natural approach is to study efficiently solvable convex *relaxations* of the problem. One can hope to use a relaxation to solve the *certification* problem, by showing the relaxation provides a bound to the optimum in non-planted instances that is sufficient to distinguish them from planted instances. One can also hope to use a relaxation to solve the *search* problem, by showing that the optimum of the relaxation is close, in appropriate sense, to the planted solution, or even that the planted solution is the unique optimum of the relaxation for most planted instances.

Interestingly, in all the examples that I am aware of, a relaxation has been successfully used to solve the certification problem if and only if it has been successfully used to solve the search problem. Intuitively, if a relaxation does not solve the certification problem, it means that there are feasible solutions in the non-planted case whose cost is already better than the cost of the planted solution, and so the planted solution cannot be the optimum, or close to the optimum, of the relaxation in the planted case. For the other direction, if a relaxation solves the certification problem, then the proof that it does can usually be lifted to a proof that all solutions that don’t “correlate” (in an appropriate sense) with the planted solution cannot be optimal solutions in the planted case, allowing one to conclude that the optimum of the relaxation in the planted case correlates with the planted solution.

In summary, although the certification problem can be harder than the search problem and the decision problem, if one wants to use a relaxation to solve the search problem then it is good to start understanding whether the relaxation solves the certification problem.

In the next post we will discuss how we can efficiently certify properties of random instances of optimization problems, and how to turn those results into algorithms that find planted solutions. We will see the key results involve showing that certain matrices associated to the instances are close to their expectations in appropriately defined norms.

]]>
*[I was delighted to receive the following guest post by Chris Brzuska about a meeting that took place last week during Eurocrypt in Tel Aviv. This piece will also appear in Omer Reingold’s blog. Let me take this opportunity for a couple of shoutouts. Next week it’s going to be two years since Italy, last among Western European countries, has instituted same-sex civil unions (yay!) and the parties that opposed it now have an absolute majority after the last elections (boo!). The Berkeley EECS department has an LGBT+ graduate student organization called QiCSE that organizes a very visible breakfast meeting during the visit days for prospective grad students and regular meetings during the school year – as much as I value Berkeley exceptionalism, think about creating something like this in your own school. It would be great if there was a LGBT+ meeting at STOC this year; I am not going to STOC this year, but maybe someone else can take the lead. And now, on to Chris’s beautiful essay. Congratulations, Chris!. — Luca]*

I gender-transitioned two years ago, and Eurocrypt 2018 in Tel-Aviv is the first major conference I attend since then. I am a bit nervous. How much time does it take for 400 people to update my name and pronouns to use “Chris” and he/him? Two years feels like an eternity to me, but surely, some people will not have heard about my gender-transition. I will need to come out to some people.

Coming-out is very empowering, but after two years and uncountable coming-outs, I really wish that everyone knows that I am trans and gay.

A gay friend of mine remarks that when being bisexual/lesbian/gay, coming out is really never over, and one needs to come out again and again, to each new person. And really, he says, there is rarely a good time to bring it up.

“How come you didn’t know I am lesbian/gay?”, I heard from several friends, in shock, worried I might have wrongly assumed they are heterosexual.

How many LGBTQIA people are in our communities? I know some LGBTQIA people in the community, but how many more are there, and how can I find them?

This simple question leads to something which would become more important to me than I expected initially.

In the rump session, I give a coming-out talk, combined with an announcement for an LGBTQIA cryptographers meeting during the rump session break ( https://eurocrypt.2018.rump.cr.yp.to/4f756d069387ee90de62454a828a3b9b.pdf).

Giving this talk in itself was very nice. I enjoyed sharing my happiness with the community, see my happiness reflected in other people’s eyes. I enjoyed the many positive comments I received during the hours and days that followed, and the recognition of daring to be visible.

During the break, I am excited and nervous. How many people will come to the meeting? And who? More than 10 people come, most of which I knew without knowing they are LGBTQIA. We walk into the room, one by one, each with light in our eyes. We came out to each other, all of us, in that moment. It’s intimate, moving, exciting. Coming out remains deeply personal. It can be daunting, even in a warm, progressive environment such as our research community and even to an LGBTQIA subgroup.

After the rump session, we go to the gay-lesbian bar Shpagat in Tel-Aviv, in happy excitement. We are the last customers that night. The next day, during the breaks, we often find ourselves with a majority of LGBTQIA people in a conversation, we sit next to each other during talks. Something important happened.

In light of our increased visibility (to each other and to the community at large), there were more opportunities for coming outs the next days (or so was my impression, although I am only conscious of 2 explicit cases…). It was very liberating for me to share many of the following conference moments with LGBTQIA cryptographers who would add additional views to a heterosexual, cissexual perspective, and who would help me explain the sensitive issue of coming out to other caring members of our research community.

The research community is my permanent country of residence, my frame of reference, the source of almost all my long-term friendships – and enfin, in this country, there live quite a few LGBTQIA people, and the research community encourages us and shares our happiness.

We are going to organize more LGBTQIA meetings alongside cryptography-related conferences. I hope, there will be more such meetings inside and outside of CS. And we look forward to see the number of LGBTQIA researchers (that we are aware of) grow.

If you are an LGBTQIA researcher who wants to get in touch with us more discretely than at a public meeting (to talk to one of us, e.g., in the beginning of your PhD etc.), you can send an eMail to queercrypt@gmail.com. You can also use that eMail address to join our mailing list (for event announcements) and/or our WhatsApp group (include your phone number if you want to join the WhatsApp group). While the group centers around cryptography-related events, the group is not limited to researchers in cryptography.

]]>新年快乐！

]]>So, today I was browsing Facebook, and when I saw a post containing an incredibly blatant arithmetic mistake (which none of the several comments seemed to notice) I spent the rest of the morning looking up where it came from.

The goal of the post was to make the wrong claim that people have been paying more than enough money into social security (through payroll taxes) to support the current level of benefits. Indeed, since the beginning, social security has been paying individuals more than they put in, and now that population and salaries have stop growing, social security is also paying out retired people more than it gets from working people, so that the “trust fund” (whether one believes it is a real thing or an accounting fiction) will run out in the 2030s unless some change is made.

This is a complicated matter, but the post included a sentence to the extent that $4,500 a year, with an interest of 1% per year “compounded monthly”, would add up to $1,3 million after 40 years. This is not even in the right order of magnitude (it adds up to about $220k) and it should be obvious without making the calculation. Who would write such a thing, and why?

My first stop was a July 2012 post on snopes, which commented on a very similar viral email. Snopes points out various mistakes (including the rate of social security payroll taxes), but the calculation in the snopes email, while based on wrong assumptions, has correct arithmetic: it says that $4,500 a year, with a 5% interest, become about $890k after 49 years.

So how did the viral email with the wrong assumptions and correct arithmetic morph into the Facebook post with the same wrong assumptions but also the wrong arithmetic?

I don’t know, but here is an August 2012 post on, you can’t make this stuff up, Accuracy in Media, which wikipedia describes as a “media watchdog.”

The post is attributed to Herbert London, who has PhD from Columbia, is a member of the Council on Foreign Relation and used to be the president of a conservative think-tank. Currently, he has an affiliation with King’s College in New York. London’s post has the sentence I saw in the Facebook post:

(…) an employer’s contribution of $375 per month at a modest one percent rate compounded over a 40 year work experience the total would be $1.3 million.

The rest of the post is almost identical to the July 2012 message reported by Snopes.

Where did Dr. London get his numbers? Maybe he compounded this hypothetical saving as 1% *per month*? No, because that would give more than $4 million. One does get about $1.3 million if one saves $375 a month for *thirty* years with a return of 1% per month, though.

Perhaps a more interesting question is why this “fake math” is coming back after five years. In 2012, Paul Ryan put forward a plan to “privatize” Social Security, and such a plan is now being revived. The only way to sell such a plan is to convince people that if they saved in a private account the amount of payroll taxes that “goes into” Social Security, they would get better benefits. This may be factually wrong, but that’s hardly the point.

]]>A calculation by a Berkeley physics graduate student (source) finds that a student who work as TA for both semesters and the summer, is payed at “step 1” of the UC Berkeley salary scale, and is a California resident, currently pays $2,229 in federal income tax, which would become $3,641 under the proposed tax plan, a 61% increase. The situation for EECS students is a bit different: they are paid at a higher scale, which puts them in a higher bracket, and they are often on a F1 visa, which means that they pay the much-higher non-resident tuition, so they would be a lot worse off (on the other hand, they usually TA at most one semester per year). The same calculation for MIT students shows a 240% tax increase. A different calculation (sorry, no link available) shows a 144% increase for a Berkeley EECS student on a F! visa.

This is one of the tax increases that go to fund the abolition of the estate tax for estates worth more than $10.9 million, a reduction in corporate tax rates, a reduction in high-income tax rates, and other benefits for multi-millionaires.

There is also a vox explainer, and articles in inside higher ed and the chronicle of higher education with more information.

If you are a US Citizen, and if you think that graduate students should not pay for the estate tax of eight-figure estates, you should let you representative know. Usually calling, and asking to speak with the staffer responsible for tax policy, is much better than emailing or sending a physical mail. You can find the phone numbers of your representatives here.

If you have any pull in ACM, this is the kind of matter on which they might want to make a factual statement about the consequences for US computer science education, as they did at the time of the travel ban.

]]>Scribed by Neng Huang

*In which we use the SDP relaxation of the infinity-to-one norm and Grothendieck inequality to give an approximation reconstruction of the stochastic block model.*

**1. A Brief Review of the Model **

First, let’s briefly review the model. We have a random graph with an unknown partition of the vertices into two equal parts and . Edges across the partition are generated independently with probability , and edges inside the partition are generated independently with probability . To abbreviate the notation, we let , which is the average internal degree, and , which is the average external degree. Intuitively, the closer are and , the more difficult it is to reconstruct the partition. We assume , although there are also similar results in the complementary model where is larger than . We also assume so that the graph is not almost empty.

We will prove the following two results, the first of which will be proved using Grothendieck inequality.

- For every , there exists a constant such that if , then we can reconstruct the partition up to less than misclassified vertices.
- There exists a constant such that if , then we can do exact reconstruct.

We note that the first result is essentially tight in the sense that for every , there also exists a constant such that if , then it will be impossible to reconstruct the partition even if an fraction of misclassified vertices is allowed. Also, the constant will go to infinity as goes to 0, so if we want more and more accuracy, needs to be a bigger and bigger constant times . When the constant becomes , we will get an exact reconstruction as stated in the second result.

**2. The Algorithm **

Our algorithm will be based on semi-definite programming. Intuitively, the problem of reconstructing the partition is essentially the same as min-bisection problem, which is to find a balanced cut with the fewest edges. This is because the balanced cut with the fewest expected edges is exactly our hidden cut. Unfortunately, the min-bisection problem is -hard, so we will use semi-definite programming. The min-bisection problem can be stated as the following program: \begin{equation*} & {\text{minimize}} & & \sum_{(u, v) \in E} \frac{1}{4}(x_u – x_v)^2

& \text{subject to} & & x_v^2 = 1, \forall v \in V

&&& \sum_{v \in V}x_v = 0. \end{equation*} Its semi-definite programming relaxation will be

Our algorithm will be as follows.

- Solve the semi-definite programming above.
- Let be the optimal solution and such that .
- Find , which is the eigenvector corresponding to the largest eigenvalue of .
- Let , .
- Output as our partition.

Ideally, we want half of the ‘s pointing to one direction, and the other half pointing to the opposite direction. In this ideal case we will have

Then will be a rank-one matrix and , which is the indicator vector of the hidden cut, will be its eigenvector with eigenvalue . The remaining eigenvalues of will be all zeros. So finding the largest eigenvector of will reveal the hidden cut. In reality, if , then our solution will be almost the same as that in the ideal case, so the cut we get will be almost the same as the hidden cut. Furthermore, if , then the unique optimal solution of the SDP will be the combinatorial solution of min-bisection problem, that is, in the vector language, the one-dimensional solution.\footnote{“A miracle”, said Luca.}

**3. Analysis of the Algorithm **

First, we rearrange the SDP to make it slightly simpler. We have the following SDP:

We note that SDP1 and SDP2 have the same optimal solution, because the cost function of SDP1 is

The first term is a constant and the second is the cost function of SDP2 with a factor of -1/4.

Now, consider the cost of SDP2 of where

The expected cost will be

Since each edge is chosen independently, with high probability our cost will be at least , which implies that the optimal solution of SDP2 will be at least . Let be the optimal solution of the SDP, then we have

n(a-b) – O(n) & \leq cost(**x**_1^\ast, \ldots, **x**_n^\ast)\nonumber

& = \sum_{u,v}A_{uv}\langle**x**_u^\ast, **x**_v^\ast\rangle\nonumber

& = \sum_{u,v}\left(A_{uv} – \frac{a+b}{n}\right)\langle**x**_u^\ast, **x**_v^\ast\rangle

In the last equality we used the fact that

When we used the spectral method last week, we said that the largest eigenvalue of is large, where is the average degree. This is because the hidden cut will give us a vector with large Rayleigh quotient. But has a relatively small spectral norm, so everything should come from , which when simplified will be 1 for entries representing vertices on the same side and -1 for entries representing vertices on different sides. We will redo this argument with SDP norm in place of spectral norm and every step appropriately adjusted.

Recall that the SDP norm of a matrix is defined to be

Let , then by Grothendieck inequality we have

We proved in the previous lecture that with high probability, so we know that the SDP norm with high probability as well. By definition, this means

Substracting 3 from 3, we obtain

where is the all-one matrix and . Plugging 5 into 4, we get

which can be simplified to

For simplicity, in the following analysis the term will be called . Notice that is a matrix with 1 for nodes from the same side of the cut and -1 for nodes from different sides of the cut, and is an inner product of two unit vectors. If is very close to zero, then the sum will be very close to . This means that should be 1 for almost every pair of , which shows that is actually very close to . Now, we will make this argument robust. To achieve this, we introduce the Frobenius norm of a matrix.

Definition 1 (Frobenius norm)Let be a matrix. The Frobenius norm of is

The following fact is a good exercise.

Fact 2Let be a matrix. Then

where denotes the spectral norm.

To see how close are and , we calculate the Frobenius norm of , which will be

This gives us a bound on the spectral norm of , namely

Let be the unit eigenvector of corresponding to its largest eigenvalue, then by Davis-Kahan theorem we have\footnote{When we apply Davis-Kahan theorem, what we get is actually an upper bound on . We have assumed here that the bound holds for , but the exact same proof will also work in the other case.}

For any , if is a large enough constant then we will have . Now we have the following standard argument:

The last inequality is because every with will contribute at least 1 in the sum . This shows that our algorithm will misclassify at most vertices.

]]>

I was very saddened to hear that Corrado Böhm died today at age 94.

Böhm was one of the founding fathers of Italian computer science. His dissertation, from 1951, was one of the first (maybe the first? I don’t know the history of these ideas very well) examples of a programming language with a compiler written in the language itself. In the 1950s and 1960s he worked at the CNR (an Italian national research institution with its own technical staff), in the IAC (Institute for the Applications of Computing) directed by mathematician Mauro Picone. IAC was the second place in Italy to acquire a computer. In 1970 he moved to the University of Turin, were he was the founding chairman of the computer science department. In 1972 he moved to the Sapienza University of Rome, in the Math department, and in 1989 he was one of the founders of the Computer Science department at Sapienza. He remained at Sapienza until his retirement.

Böhm became internationally known for a 1966 result, joint with Giuseppe Jacopini, in which he showed, roughly speaking, that programs written in a language that includes goto statements (formalized as flow-charts) could be mapped to equivalent programs that don’t. The point of the paper was that the translation was “structural” and the translated program retained much of the structure and the logic of the original program, meaning that programmers could give up goto statements without having to fundamentally change the way they think.

Dijkstra’s famous “Go To Statement Considered Harmful” 1968 letter to CACM had two references, one of which was the Jacopini-Böhm theorem.

Böhm was responsible for important foundational work on lambda calculus, typed functional languages, and the theory of programming languages at large.

He was a remarkable mentor, many of whose students and collaborators (including a notable number of women) became prominent in the Italian community of theory of programming languages, and Italian academia in general.

In the photo above is Böhm with Simona Ronchi, Betti Venneri and Mariangiola Dezani, who all became prominent Italian professors.

You may also recognize the man on the right as a recent recipient of the Turing Award. Silvio Micali went to Sapienza to study math as an undergrad, and he worked with Böhm, who encouraged Silvio to pursue his PhD abroad.

I studied Computer Science at Sapienza, starting the first year that the major was introduced in 1989. I remember that when I first met Böhm he reminded me of Doc Brown from *Back to the Future*: a tall man with crazy white hair, speaking of wild ideas with incomprehensible technical terms, but with unstoppable enthusiasm.

One year, I tried attending a small elective class that he was teaching. My, probably imprecise, recollection of the first lecture is as follows.

He said that one vertex is a binary tree, and that if you connect two binary trees to a new root you also get a binary tree, then he asked us, how would you prove statements on binary trees by induction? The class stopped until we would say something. After some consultation among us, one of the smart kids proposed “by induction on the number of vertices?” Yes, said Böhm, that would work, but isn’t there a better way? He wanted us to come up by ourselves with the insight that, since binary trees have a recursive definition, one can do induction on the structure of the definition.

In subsequent lectures, we looked (without being told) at how to construct purely functional data structures. I dropped the class after about a month.

(Photo credits: corradobohm.it)

]]>
*In which we go over a more powerful (but difficult to compute) alternative to the spectral norm, and discuss how to approximate it.*

Today we’ll discuss a solution to the issue of high-degree vertices distorting spectral norms, which will prepare us for next lecture’s discussion on community detection in the stochastic block model using SDP. We’ll discuss a new kind of norm, the *infinity-to-one* norm, and find an efficient way to approximate it using SDP.

**1. Fun with Norms **

In the past few lectures, we’ve been heavily relying on the spectral norm,

which is efficiently computable and turns out to be pretty handy in a lot of cases.

Unfortunately, high-degree vertices have a disproportionately large influence on the spectral norm’s value, limiting its usefulness in graphs where such outliers exist. Often (as we did in lecture 9), people will try to modify the input so that there are no vertices of overly high degree or add some regularizing term to ameliorate this issue. Unfortunately, this can lead to less useful results – in lecture 9, for instance, we derived a bound that required that all high-degree vertices be excised from the input graph first.

In this lecture, we’ll attack this problem in a different way by introducing a different norm, the *infinity-to-one* norm, defined as follows:

It can be shown that

So spectral norm always gives us a bound for the infinity-to-one norm. However, the infinity-to-one norm can come in even more handy than the spectral norm (if we can actually calculate it):

Theorem 1Let be a symmetric matrix with entries in . Pick a random graph such that is in w.p. . Then whp:

For context, recall that we proved a similar bound in Lecture 9 for the spectral norm, but we required that all nodes with degree greater than twice the average degree be removed first. The zero-to-one norm allows us to bypass that restriction entirely.

*Proof:* Fix . Examine the expression

We want to make sure that this probability exponentially decreases w.r.t. .

Recall that Bernstein’s inequality states that given independent random variables, absolutely bounded by , with expectation , we have .

is either or and therefore is bounded by . Since , we can have the from Bernstein’s inequality take on value . Combining this with the fact that

Bernstein’s inequality gives us

provided that , as desired.

So this norm allows us to easily sidestep the issue of high-degree vertices. The problem is that it’s NP-hard to compute.

**2. Grothendieck’s Inequality **

However, it turns out that we can *approximate* the infinity-to-one norm to within a constant factor using SDP:

Theorem 2(Grothendieck’s Inequality) There exists some (turns out to be around , but we won’t worry about its exact value here) such that for all ,

So instead of dealing the combinatorially huge problem of optimizing over all vectors, we can just solve an SDP instead to get a good approximation. For convenience, let’s denote the quantity on the left of the above expression as .

Let’s start with a warmup lemma. The proof techniques in this lemma – specifically, the trick of replacing a vector from a continuous distribution with a *random* vector from some discrete distribution, and then taking the expectation to relate the two quantities, will come in handy later on.

Lemma 3. In other words, maximizing over the discrete space of random vectors and maximizing over the continuous space of vectors in that range gives the same result.

*Proof:* It is obvious that the expression on the left is at least the right (since it’s just a relaxation). In order to show that the right is at least the left: given some continuous vectors , we can find discrete , vectors such that their expectations are equal to . We can do that by having take on value w.p. and w.p. , likewise for .

That means that:

which gives us the desired result.

Fact 4is a norm.

*Proof:* **Multiplicative scaling: **obvious.

**Nonnegativity (except iff ):** It is obvious that the SDP norm is zero if is zero.

Now suppose is nonzero.

Notice that we can replace the constraints in the SDP norm requiring that , with , . Why? We’ll use the same trick as we did in the proof of Lemma 3:

Obviously maximizing over , will give us at least as good a result as maximizing over , , since it’s a relaxation, so it suffices to show that if we can obtain some value for using , , we can do at least as well using , .

Let’s suppose we have some vectors with length at most . Now let’s replace them with *random *vectors* * of length exactly whose expectation are respectively (just scale the up, and have be either the scaled value or its negative with probability calibrated appropriately so their expectations work out to be ). Then we can just say:

So must take on some values (of length exactly ) that make , as desired.

Since we assumed is nonzero, let , be such that . If we set to be some arbitrary vector of unit length – using a plus if is positive and a minus if it’s negative – and all other to zero (which we can do without affecting the value of the max by the fact we just proved), we can immediately see that

giving a positive lower bound for the maximizer, as desired.

**Triangle inequality:** Just look at the and that maximize for and , and observe that

since we can always match the quantity on the left hand side by choosing the same and for both terms on the right-hand-side.

** 2.1. Proof of Grothendieck’s inequality **

Now let’s prove Grothendieck’s inequality:

*Proof:* Observe that, by Lemma 3, maximizing over the choice of in the zero-to-one norm is equivalent to maximizing over the choice of , so we can rewrite our proof goal as

Let be of length and be the optimal choices used in , i.e. the maximizers of . It suffices to prove that there exist vectors (for some fixed constant , since we can just scale the result by changing ) to plug into the right-hand side of the above expression such that .

Pick , with each coordinate being drawn from the normal distribution with mean and variance , and let and . Then

But is just the identity, since the diagonal elements are just the expectation of the square of the Gaussian (which is , its variance) and the off-diagonals are the expectation of the product of two independent Gaussians, which is zero (since the expectation of each individual Gaussian is zero).

So

At this point we’ve got something that looks a lot like what we want: if the expectation of is equal to , then maximizing over them is definitely going to give us a quantity greater than or equal to . Unfortunately, there’s an issue here: since Gaussian random variables are unbounded, and are unbounded; on the other hand, what we’re trying to do is maximize over vectors whose elements are bounded by .

Our approach will be to “clip” the and to a finite range, and then bound the error introduced by the clipping process. Formally, fix a constant and pick a Gaussian random vector . Now define a ‘clipped’ inner product:

and likewise for . For convenience, let’s define the *truncation error* as follows, to represent how far the clipped value differs from the actual value.

So we can rewrite as:

and similarly we can define:

So now we have

Clearly is just the SDP norm , since and were the ‘original’ and .

What we will show now is that the remaining three terms are bounded by constant factors of SDP norm, so the entire sum of all four terms getting a constant factor approximation of it. The analysis is the same for all three terms so, for brevity, we’ll just look at the last one:

For convenience, let’s define as:

Now, it’s convenient to think of the and as vectors of infinite dimension indexed by input (we’ll bring the dimensionality down to a finite value at the end of the proof). Let’s define the following inner product and norm in this space:

Now, since the Gaussian distribution is rotation-independent (we can just rotate a Gaussian random variable around without changing its distribution), the squared norms of the and the are all the same (since all the , have length , and dotting with them can be thought of a rotation). That means that the above norm takes on the same value for all , , so all we need to do is figure out a constant bound on it.

Fortunately, this value is pretty easy to bound. Notice that the function is zero if the dot product’s absolute value is smaller than . If we have ,

for sufficiently large , which is a constant.

So if the squared norms are bounded above by , which means we can substitute this and the norm bound into the above expression to get:

Now, armed with this, we can conclude (with similar results for the other two error terms)

which tells us that , is within a constant bound of the SDP norm as desired.

There’s only one slightly fishy bit in the proof we used above, though, and that’s the treatment of functions of infinite-dimensional vectors indexed by a Gaussian vector. Let’s conclude by constructing a (finite-dimensional) solution to the SDP from the functions:

**Claim:** If , then .

**Proof: **

** **as desired.

So the matrix comprises a nice finite-dimensional solution to the SDP, and we’re done with the proof.

Also, noticing that we have four terms in the expansion of , each one of which is a feasible value for the SDP, we can figure out how much we deviate from the actual optimum, i.e. . Since each term can’t exceed the value of the SDP optimum at all, is separated from by a factor of four, giving us a bound on the constant in the inequality.

To recap, notice that there were two key “tricks” here:

1) Assuming that we were rounding an optimal solution to our SDP (i.e. starting with as optimizers). We don’t get any bounds otherwise!

2) Treating the rounding error itself as a feasible solution of the SDP.

This proof was communicated to us by James Lee.

]]>Scribed by Chinmay Nirkhe

*In which we explore the Stochastic Block Model.*

**1. The problem **

The *Stochastic Block Model* is a generic model for graphs generated by some parameters. The simplest model and one we will consider today is the problem.

Definition 1 ( graph distribution)The distribution is a distribution on graphs of vertices where is partitioned into two 2 subsets of equal size: . Then for pair of vertices in the same subset, and otherwise .

We will only consider the regime under which . If we want to find the partition , it is intuitive to look at the problem of finding the minimum balanced cut. The cut has expected size and any other cut will have greater expected size.

Our intuition should be that as , the problem only gets harder. And for fixed ratio , as , the problem only gets easier. This can be stated rigorously as follows: If we can solve the problem for then we can also solve it for where , by keeping only edges and reducing to the case we can solve.

Recall that for the -planted clique problem, we found the eigenvector corresponding to the largest eigenvalue of . We then defined as the vertices with the largest values of and cleaned up a little to get our guess for the planted clique.

In the Stochastic Block Model we are going to follow a similar approach, but we are instead going to find the largest eigenvalue of . Note this is intuitive as the average degree of the graph is . The idea is simple: Solve the largest eigenvector corresponding to the largest eigenvalue and define

As we proceed to the analysis of this procedure, we fix . Prior to fixing, the adjacency matrix was .\footnote{The diagonal should be zeroes, but this is close enough.} Upon fixing , the average adjacency matrix looks different. For ease of notation, if we write a bold constant for a matrix, we mean the matrix . It will be clear from context.

Here we have broken up into blocks according to the partition .

Theorem 2If then with high probability, .

*Proof:* Define the graph as the union of a graph on and graph on . Define the graph a a graph. Note that the graph is distributed according to picking a and graph and adding the partition crossing edges of to . Let and be the respective adjacency matrices and define the follow submatrices:

Then the adjacency matrix is defined by

Similarly, we can generate a decomposition for :

Then using triangle inequality we can bound by bounding the difference in the various terms.

The last line follows as the submatrices are adjacency matrices of graphs and we can apply the results we proved in that regime for .

But the difficulty is that we don’t know as . If we knew , then we would know the partition. What we can compute is .\footnote{The rest of this proof actually doesn’t even rely on knowing or . We can estimate by calculating the average vertex degree.} We can rewrite as

Call the matrix on the right . It is clearly rank-one as it has decomposition where . Therefore

Then is close (in operator norm) to the rank 1 matrix . Then their largest eigenvalues are close. But since has only one non-zero eigenvalue , finding the corresponding eigenvector to the largest eigenvalue of will be close to the ideal partition as describes the ideal partition. This can be formalized with the Davis-Kaham Theorem.

Theorem 3 (Davis-Kahan)Given matrices with where has eigenvalues and corresponding eigenvectors and has eigenvalues and corresponding eigenvectors , thenEquivalently,

The Davis Kahan Theorem with , and states that

where , the eigenvector associated with the largest eigenvalue of and , the expected degrees of the two parts of the graph. Choose between for the one closer to . Then

Recall that . If and disagree in sign, then this contributes at least to the value of . Equivalently, is at least the number of misclassified vertices. It is simple to see from here that if then we can bound the number of misclassified vertices by . This completes the proof that the proposed algorithm does well in calculating the partition of the Stochastic Block Model.

]]>