An Alternative to the Seddighin-Hajiaghayi Ranking Methodology

[Update 10/24/14: there was a bug in the code I wrote yesterday night, apologies to the colleagues at Rutgers!]

[Update 10/24/14: a reaction to the authoritative study of MIT and the University of Maryland. Also, coincidentally, today Scott Adams comes down against reputation-based rankings]

Saeed Seddighin and MohammadTaghi Hajiaghayi have proposed a ranking methodology for theory groups based on the following desiderata: (1) the ranking should be objective, and based only on quantitative information and (2) the ranking should be transparent, and the methodology openly revealed.

Inspired by their work, I propose an alternative methodology that meets both criteria, but has some additional advantages, including having an easier implementation. Based on the same Brown University dataset, I count, for each theory group, the total number of letters in the name of each faculty member.

Here are the results (apologies for the poor formatting):

1 ( 201 ) Massachusetts Institute of Technology
2 ( 179 ) Georgia Institute of Technology
3 ( 146 ) Rutgers – State University of New Jersey – New Brunswick
4 ( 142 ) University of Illinois at Urbana-Champaign
5 ( 141 ) Princeton University
6 ( 139 ) Duke University
7 ( 128 ) Carnegie Mellon University
8 ( 126 ) University of Texas – Austin
9 ( 115 ) University of Maryland – College Park
10 ( 114 ) Texas A&M University
11 ( 111 ) Northwestern University
12 ( 110 ) Stanford University
13 ( 108 ) Columbia University
14 ( 106 ) University of Wisconsin – Madison
15 ( 105 ) University of Massachusetts – Amherst
16 ( 105 ) University of California – San Diego
17 ( 98 ) University of California – Irvine
18 ( 94 ) New York University
19 ( 94 ) State University of New York – Stony Brook
20 ( 93 ) University of Chicago
21 ( 91 ) Harvard University
22 ( 91 ) Cornell University
23 ( 87 ) University of Southern California
24 ( 87 ) University of Michigan
25 ( 85 ) University of Pennsylvania
26 ( 84 ) University of California – Los Angeles
27 ( 81 ) University of California – Berkeley
28 ( 78 ) Dartmouth College
29 ( 76 ) Purdue University
30 ( 71 ) California Institute of Technology
31 ( 67 ) Ohio State University
32 ( 63 ) Brown University
33 ( 61 ) Yale University
34 ( 54 ) University of Rochester
35 ( 53 ) University of California – Santa Barbara
36 ( 53 ) Johns Hopkins University
37 ( 52 ) University of Minnesota – Twin Cities
38 ( 49 ) Virginia Polytechnic Institute and State University
39 ( 48 ) North Carolina State University
40 ( 47 ) University of Florida
41 ( 45 ) Rensselaer Polytechnic Institute
42 ( 44 ) University of Washington
43 ( 44 ) University of California – Davis
44 ( 44 ) Pennsylvania State University
45 ( 40 ) University of Colorado Boulder
46 ( 38 ) University of Utah
47 ( 36 ) University of North Carolina – Chapel Hill
48 ( 33 ) Boston University
49 ( 31 ) University of Arizona
50 ( 30 ) Rice University
51 ( 14 ) University of Virginia
52 ( 12 ) Arizona State University
53 ( 12 ) University of Pittsburgh

I should acknowledge a couple of limitations of this methodology: (1) the Brown dataset is not current, but I believe that the results would not be substantially different even with current data, (2) it might be reasonable to only count the letters in the last name, or to weigh the letters in the last name by 1 and the letters in the first name by 1/2. If there is sufficient interest, I will post rankings according to these other methodologies.

Lies, Damn Lies, and Predictions

Oh man, not another election! Why do we have to choose our leaders? Isn’t that what we have the Supreme Court for?
— Homer Simpson

Nate Silver is now putting Barak Obama’s chance of reelection at around 85%, and he has been on the receiving end of considerable criticism from supporters of Mitt Romney. Some have criticized his statistical analysis by pointing out that he has a soft voice and he is not fat (wait, what? read for yourself – presumably the point is that Silver is gay and that gay people cannot be trusted with such manly pursuits as statistics), but the main point seems to be: if Romney wins the election then Silver and his models are completely discredited. (E.g. here.) This is like someone saying that a die has approximately a 83% probability of not turning a 2, and others saying, if I roll a die and it turns a 2, this whole “probability” thing that you speak of is discredited.

But still, when someone offers predictions in terms of probability, rather than simply stating that a certain outcome is more likely, how can we evaluate the quality of such predictions?

In the following let us assume that we have a sequence of binary events, and that each event i has a probability p_i of occurring as a 1 and 1-p_i of occurring as 0. A predictor gives out predicted probabilities q_i, and then events E_i happen. Now what? How would we score the predictions? Equivalently, how would we fairly compensate the predictor?

A simple way to “score” the prediction is to say that for each event we have a “penalty” that is |E_i - p_i|, or a score that is 1- |E_i - p_i|. For example, the prediction that the correct event happens with 100% probability gets a score of 1, but the prediction that the correct event happens with 85% probability gets a score of .85.

Unfortunately this scoring system is not “truthful,” that is, it does not encourage the predictor to tell us the true probabilities. For example suppose that a predictor has computed the probability of an event as 85% and is very confident in the accuracy of the model. Then, if he publishes the accurate prediction he is going to get a score of .85 with probability .85 and a score .15 with probability .15. So he is worse off than if he had published the prediction of the event happening with probability 100%, in which case the expected score is .85. In general, the scheme makes it always advantageous to round the probability to 0% or 100%.

Is there a truthful scoring system? I am not sure what the answer is.

If one is scoring multiple predictions of independent events, one can look at all the cases in which the prediction was, say, in the range of 80% to 90%, and see if indeed the event happened, say, a fraction between 75% and 95% of the times, and so on.

One disadvantage of this approach is that it seems to require a discretization of the probabilities, which seems like an arbitrary choice and one that could affect the final score quite substantially. Is there a more elegant way to score multiple independent events without resorting to discretization? Can it be proved to be truthful?

Another observation is that such an approach is still not entirely truthful if it is applied to events that happen sequentially. Indeed, suppose that we have a series of, say, 10 events for which we predicted a 60% probability of a 1, and the event 1 happened 7 out of 10 times. Now we have to make a prediction of a new event, for which our model predicts a 10% probability. We may then want to publish a 60% prediction, because this will help even out the “bucket” of 60% predictions.

I don’t think that there is any way around the previous problem, though it seems clear that it would affect only a small fraction of the predictions. (The complexity theorists among the readers may remember similar ideas being used in a paper of Feigenbaum and Fortnow.)

Surely the task of scoring predictions must have been studied in countless papers, and the answers to the above questions must be well known, although I am not sure what are the right keywords to use to search for such work. In computer science, there are a lot of interesting results about using expert advice, but they are all concerned with how you score your own way of picking which expert to trust rather than the experts themselves. (This means that the predictions of the experts are not affected by the scoring system, unlike the setting discussed in this post.)

Please contribute ideas and references in the comments.

The New York Times on False Positives and False Negatives

From this New York Times article:

Researchers found the home test accurate 99.98 percent of the time for people who do not have the virus. By comparison, they found it to be accurate 92 percent of the time in detecting people who do. […]

So, while only about one person in 5,000 would get a false negative test, about one person in 12 could get a false positive.

Lies, Damn Lies, and Spaceflight Safety

From an interview with Ed Mango, head of NASA’s commercial crew program, in which he discusses safety requirements for commercial entities who want to subcontract flights to the ISS from NASA.

Chaikin: And the probability of “loss of crew” has to be better than 1 in 1000?

Mango: Yes and no. What we’ve done is we’ve separated those into what you need for ascent and what you need for entry. For ascent it’s 1 in 500, and independently for entry it’s 1 in 500. We don’t want industry … to [interpret the 1-in-1,000 requirement] to say, “We’ve got a great ascent; we don’t need as much descent protection.” In reality we’ve got to protect the life of the crew all the time.

Now [the probability for] the mission itself is 1 in 270. That is an overall number. That’s loss of crew for the entire mission profile, including ascent, on-orbit, and entry. The thing that drives the 1 in 270 is really micrometeorites and orbital debris … whatever things that are in space that you can collide with. So that’s what drops that number down, because you’ve got to look at the 210 days, the fact that your heat shield or something might be exposed to whatever that debris is for that period of time. NASA looks at Loss of Vehicle the same as Loss of Crew. If the vehicle is damaged and it may not be detected prior to de-orbit, then you have loss of crew.

What does “yes” mean in the “yes and no” answer? Also, with a 1/500 probability of accident at takeoff and an independent 1/500 probability of accident at landing, we are already at a 1/250.2 probability of accident, so how do we get to 1/270 after adding accidents in mid-flight?

In not entirely unrelated news, a member of the board of Florida’s 3rd district took, and failed, the Florida Comprehensive Assessment Test (FCAT), a standardized test, as documented in two posts on the Washington Post blog.

Relevant quote:

“I won’t beat around the bush. The math section had 60 questions. I knew the answers to none of them, but managed to guess ten out of the 60 correctly. On the reading test, I got 62% . In our system, that’s a ‘D,’ and would get me a mandatory assignment to a double block of reading instruction.

“It seems to me something is seriously wrong. I have a bachelor of science degree, two masters degrees, and 15 credit hours toward a doctorate. I help oversee an organization with 22,000 employees and a $3 billion operations and capital budget, and am able to make sense of complex data related to those responsibilities….

Here is a sample of the math portion of the 10th grade FCAT, the most advanced one. Sample question: “An electrician charges a $45 fee to make a house call plus an hourly rate for labor. If the electrician works at one house for 3 hours and charges $145.50 for the job, what is the electrician’s hourly rate?” You can use a calculator.

Lies, Damn Lies, and the Royal Society Report

A few days ago, the Royal Society released a report on “Global Scientific Collaboration in the 21st Century.” Usually, when a document has “in the 21st Century” in the title, it can only go downhill from there. (I once had to review a paper that started with “As we enter the second decade of the 21st Century…” and, hard as it is to believe, it did go further downhill from there.) But the Royal Society is one of the most hallowed of scientific institutions, so one might have still hoped for the best.

The report was widely quoted in the press as predicting that China would overtake the United States in scientific output by 2013.

Indeed, in Section 1.6 (pages 42-43), the report uses data provided by Elsevier to estimate the number of scientific papers produced in various countries. We’ll skip the objection that the number of papers is a worthless measure of scientific output and go to figure 1.6 in the report, reproduced below.

The figure plots the percentage of scientific papers coming out of various countries, and then proceeds to do a linear interpolation of the percentages to create a projection for the future.

While such an approach shows China overtaking the US in 2013, it also shows, more ominously, China publishing 110% of all scientific papers by 2100. (The report concedes that linear interpolation might not make a lot of sense, yet the picture is there.)

Lies, Damn Lies, and National Review (Part III)

Deroy Murdock writes about the snow that has been falling in North America (not Vancouver, though!) and Europe in the past three months:

Forty-nine of these 50 United States simultaneously laughed at “global warming.” Absent Hawaii, every state in the Union had measurable snow on Saturday, February 13. An average eight inches covered 68.1 percent of the continental U.S., well above January’s more typical 51.2 percent. Mobile, Ala., saw its first snow in twelve years. Shreveport, La., got 5.4 inches. Dallas residents coped with 12.5 inches of snow — a one-day record.

Florida’s unusually cold coastal waters (in the 40s, in some places) have killed a record 280 manatees through February 12. As CNN reports, manatees like it above 68 degrees. Some now warm themselves in the 78-degree discharge of a Tampa Electric Company power plant.
February 12’s storm followed a Nor’easter that shuttered Washington, D.C.’s federal offices for four days. Through February 14, this winter’s 55.9 inches of snow smothered Washington’s previous record, 54.4 inches, from winter 1898–1899.

How about last month?

“In Europe, snowstorms and sub-zero temperatures severely disrupted air, rail, and road transport,” the U.S. National Ice Center’s Sean R. Helfrich concluded January 12. “The snow events impacted hundreds of millions of people world-wide, with a number of weather-related deaths reported.”

Meanwhile, December’s snow cover was North America’s greatest, and the Northern Hemisphere’s second greatest, in the 44 years measured.

Warmists correctly retort that three frigid months establish no pattern.

If Deroy Murdock had talked to a “warmist,” however, he might have found out that, far from being “frigid,” last month was the hottest January on record according to global satellite measurements (the record goes back 32 years), and the fourth warmest since 1880 according to surface measurements.

Lies, Damn Lies, and the Number of Sexual Partners

A few days ago, Gina Kolata reported in the New York Times on the paradox of studies on sexual behavior consistently reporting (heterosexual) men having more sexual partners than women, with a recent US study reporting men having a median number of 7 partners and women a median number of 4. Contrary to what’s stated in the paper, this is not mathematically impossible (key word: median). It is however quite implausible, requiring a relatively small number of women to account for a large fraction of all men’s partners.

An answer to this paradox can be found in Truth and consequences: using the bogus pipeline to examine sex differences in self-reported sexuality, by Michele Alexander and Terry Fisher, Jorunal of Sex Research 40(1), February 2003.

In their study, a sample of men and women are each divided into three groups and asked to fill a survey on sexual behavior. People in one group filled the survey alone in a room with an open door, a researcher sitting outside, and after being told the study was not anonymous; people in a second group filled the survey in a room with a closed door and an explicit assurance of anonymity; people in a third group filled the survey attached to what they believe to be a working “lie detector.”

In the first group, women reported on average 2.6 partners, men 3.7. In the second group, it was women 3.4 and men 4.2. In the third group, it was women 4.4 and men 4.0.

(The study looks at several other quantities, and some of them have even wider variance in the three settings.)

So, not surprisingly given the sexual double standards in our culture, men and women lie about their sexual behavior (men overstate, women understate), and do less so in an anonymous setting or when the lie is likely to be discovered.

Here is the reporting of the first group put to music:

[Update 8/18/07: so many people must have emailed her about the median versus average issue in the article that Gina Kolata wrote a clarification. Strangely, she does not explain, for the rest of the readers, what the difference is and why it is possible, if unlikely, to have very different medians for men and women. The claim in the clarification, by the way, is still wrong: those 9.4% of women with 15 or more partners could be accounting for all the missing sex.]

Lies, damn lies, and a Dutch trial

In CS70, the Berkeley freshman/sophomore class on discrete mathematics and probability for computer scientists, we conclude the section on probability with a class on how to lie with statistics. The idea is not to teach the students how to lie, but rather how not to be lied to. The lecture focuses on the correlation versus causation fallacy and on Simpson’s paradox.

My favorite way of explaining the correlation versus causation fallacy is to note that there is a high correlation between being sick and having visited a health care professional in the recent past. Hence we should prevent people from seeing doctors in order to make people healthier. Some HMOs in the US are already following this approach.

Today, a post in a New York Times science blog tells the story of a gross misuse of statistics in a Dutch trial that has now become a high-profile case. In the Dutch case two other, and common, fallacies have come up. One is, roughly speaking, neglecting to take a union bound. This is the fallacy of saying ‘I just saw the license plate California 3TDA614, what are the chances of that!’ The other is the computation of probabilities by making unwarranted independence assumptions.

Feynman has written eloquently about both, but I don’t have the references at hand. In particular, when he wrote on his Space Shuttle investigation committee work, he remarked that official documents had given exceedingly low probabilities of a major accident (of the order of one millionth per flight or less), even though past events have shown this probability to be more of the order of 1%. The low number was obtained by summing the probabilities of various scenarios, and the probability of each scenario was obtained by multiplying estimates for the probabilities that the various things that had to go wrong for that scenario to occur would indeed go wrong.

Christos Papadimitriou has the most delightful story on this fallacy. He mentioned in a lecture the Faloutsos-Faloutsos-Faloutsos paper on power law distributions in the Internet graph. One student remarked, wow, what are the chances of all the authors of a paper being called Faloutsos!