You are currently browsing the category archive for the ‘lies damn lies and …’ category.
Oh man, not another election! Why do we have to choose our leaders? Isn’t that what we have the Supreme Court for?
— Homer Simpson
Nate Silver is now putting Barak Obama’s chance of reelection at around 85%, and he has been on the receiving end of considerable criticism from supporters of Mitt Romney. Some have criticized his statistical analysis by pointing out that he has a soft voice and he is not fat (wait, what? read for yourself – presumably the point is that Silver is gay and that gay people cannot be trusted with such manly pursuits as statistics), but the main point seems to be: if Romney wins the election then Silver and his models are completely discredited. (E.g. here.) This is like someone saying that a die has approximately a 83% probability of not turning a 2, and others saying, if I roll a die and it turns a 2, this whole “probability” thing that you speak of is discredited.
But still, when someone offers predictions in terms of probability, rather than simply stating that a certain outcome is more likely, how can we evaluate the quality of such predictions?
In the following let us assume that we have a sequence of binary events, and that each event has a probability of occurring as a and of occurring as . A predictor gives out predicted probabilities , and then events happen. Now what? How would we score the predictions? Equivalently, how would we fairly compensate the predictor?
A simple way to “score” the prediction is to say that for each event we have a “penalty” that is , or a score that is . For example, the prediction that the correct event happens with 100% probability gets a score of 1, but the prediction that the correct event happens with 85% probability gets a score of .85.
Unfortunately this scoring system is not “truthful,” that is, it does not encourage the predictor to tell us the true probabilities. For example suppose that a predictor has computed the probability of an event as 85% and is very confident in the accuracy of the model. Then, if he publishes the accurate prediction he is going to get a score of .85 with probability .85 and a score .15 with probability .15. So he is worse off than if he had published the prediction of the event happening with probability 100%, in which case the expected score is .85. In general, the scheme makes it always advantageous to round the probability to 0% or 100%.
Is there a truthful scoring system? I am not sure what the answer is.
If one is scoring multiple predictions of independent events, one can look at all the cases in which the prediction was, say, in the range of 80% to 90%, and see if indeed the event happened, say, a fraction between 75% and 95% of the times, and so on.
One disadvantage of this approach is that it seems to require a discretization of the probabilities, which seems like an arbitrary choice and one that could affect the final score quite substantially. Is there a more elegant way to score multiple independent events without resorting to discretization? Can it be proved to be truthful?
Another observation is that such an approach is still not entirely truthful if it is applied to events that happen sequentially. Indeed, suppose that we have a series of, say, 10 events for which we predicted a 60% probability of a 1, and the event 1 happened 7 out of 10 times. Now we have to make a prediction of a new event, for which our model predicts a 10% probability. We may then want to publish a 60% prediction, because this will help even out the “bucket” of 60% predictions.
I don’t think that there is any way around the previous problem, though it seems clear that it would affect only a small fraction of the predictions. (The complexity theorists among the readers may remember similar ideas being used in a paper of Feigenbaum and Fortnow.)
Surely the task of scoring predictions must have been studied in countless papers, and the answers to the above questions must be well known, although I am not sure what are the right keywords to use to search for such work. In computer science, there are a lot of interesting results about using expert advice, but they are all concerned with how you score your own way of picking which expert to trust rather than the experts themselves. (This means that the predictions of the experts are not affected by the scoring system, unlike the setting discussed in this post.)
Please contribute ideas and references in the comments.
From this New York Times article:
Researchers found the home test accurate 99.98 percent of the time for people who do not have the virus. By comparison, they found it to be accurate 92 percent of the time in detecting people who do. [...]
So, while only about one person in 5,000 would get a false negative test, about one person in 12 could get a false positive.
From the New York Times coverage of the Iowa caucuses:
It was the closest race in the history of the Iowa caucuses. In 1980, George Bush beat Ronald Reagan by two percentage points; only a tenth of a percentage point separated Mr. Romney from Mr. Santorum on Tuesday.
They were separated by 8 votes out of more than 120,000 voters.
From an interview with Ed Mango, head of NASA’s commercial crew program, in which he discusses safety requirements for commercial entities who want to subcontract flights to the ISS from NASA.
Chaikin: And the probability of “loss of crew” has to be better than 1 in 1000?
Mango: Yes and no. What we’ve done is we’ve separated those into what you need for ascent and what you need for entry. For ascent it’s 1 in 500, and independently for entry it’s 1 in 500. We don’t want industry … to [interpret the 1-in-1,000 requirement] to say, “We’ve got a great ascent; we don’t need as much descent protection.” In reality we’ve got to protect the life of the crew all the time.
Now [the probability for] the mission itself is 1 in 270. That is an overall number. That’s loss of crew for the entire mission profile, including ascent, on-orbit, and entry. The thing that drives the 1 in 270 is really micrometeorites and orbital debris … whatever things that are in space that you can collide with. So that’s what drops that number down, because you’ve got to look at the 210 days, the fact that your heat shield or something might be exposed to whatever that debris is for that period of time. NASA looks at Loss of Vehicle the same as Loss of Crew. If the vehicle is damaged and it may not be detected prior to de-orbit, then you have loss of crew.
What does “yes” mean in the “yes and no” answer? Also, with a 1/500 probability of accident at takeoff and an independent 1/500 probability of accident at landing, we are already at a 1/250.2 probability of accident, so how do we get to 1/270 after adding accidents in mid-flight?
In not entirely unrelated news, a member of the board of Florida’s 3rd district took, and failed, the Florida Comprehensive Assessment Test (FCAT), a standardized test, as documented in two posts on the Washington Post blog.
“I won’t beat around the bush. The math section had 60 questions. I knew the answers to none of them, but managed to guess ten out of the 60 correctly. On the reading test, I got 62% . In our system, that’s a ‘D,’ and would get me a mandatory assignment to a double block of reading instruction.
“It seems to me something is seriously wrong. I have a bachelor of science degree, two masters degrees, and 15 credit hours toward a doctorate. I help oversee an organization with 22,000 employees and a $3 billion operations and capital budget, and am able to make sense of complex data related to those responsibilities….
Here is a sample of the math portion of the 10th grade FCAT, the most advanced one. Sample question: “An electrician charges a $45 fee to make a house call plus an hourly rate for labor. If the electrician works at one house for 3 hours and charges $145.50 for the job, what is the electrician’s hourly rate?” You can use a calculator.
A few days ago, the Royal Society released a report on “Global Scientific Collaboration in the 21st Century.” Usually, when a document has “in the 21st Century” in the title, it can only go downhill from there. (I once had to review a paper that started with “As we enter the second decade of the 21st Century…” and, hard as it is to believe, it did go further downhill from there.) But the Royal Society is one of the most hallowed of scientific institutions, so one might have still hoped for the best.
The report was widely quoted in the press as predicting that China would overtake the United States in scientific output by 2013.
Indeed, in Section 1.6 (pages 42-43), the report uses data provided by Elsevier to estimate the number of scientific papers produced in various countries. We’ll skip the objection that the number of papers is a worthless measure of scientific output and go to figure 1.6 in the report, reproduced below.
The figure plots the percentage of scientific papers coming out of various countries, and then proceeds to do a linear interpolation of the percentages to create a projection for the future.
While such an approach shows China overtaking the US in 2013, it also shows, more ominously, China publishing 110% of all scientific papers by 2100. (The report concedes that linear interpolation might not make a lot of sense, yet the picture is there.)
Deroy Murdock writes about the snow that has been falling in North America (not Vancouver, though!) and Europe in the past three months:
Forty-nine of these 50 United States simultaneously laughed at “global warming.” Absent Hawaii, every state in the Union had measurable snow on Saturday, February 13. An average eight inches covered 68.1 percent of the continental U.S., well above January’s more typical 51.2 percent. Mobile, Ala., saw its first snow in twelve years. Shreveport, La., got 5.4 inches. Dallas residents coped with 12.5 inches of snow — a one-day record.
Florida’s unusually cold coastal waters (in the 40s, in some places) have killed a record 280 manatees through February 12. As CNN reports, manatees like it above 68 degrees. Some now warm themselves in the 78-degree discharge of a Tampa Electric Company power plant.
February 12’s storm followed a Nor’easter that shuttered Washington, D.C.’s federal offices for four days. Through February 14, this winter’s 55.9 inches of snow smothered Washington’s previous record, 54.4 inches, from winter 1898–1899.
How about last month?
“In Europe, snowstorms and sub-zero temperatures severely disrupted air, rail, and road transport,” the U.S. National Ice Center’s Sean R. Helfrich concluded January 12. “The snow events impacted hundreds of millions of people world-wide, with a number of weather-related deaths reported.”
Meanwhile, December’s snow cover was North America’s greatest, and the Northern Hemisphere’s second greatest, in the 44 years measured.
Warmists correctly retort that three frigid months establish no pattern.
If Deroy Murdock had talked to a “warmist,” however, he might have found out that, far from being “frigid,” last month was the hottest January on record according to global satellite measurements (the record goes back 32 years), and the fourth warmest since 1880 according to surface measurements.
“[that life expectancy in Canada is higher than in the USA] is to be expected, Peter, because we have 10 times as many people as you do. That translates to 10 times as many accidents, crimes, down the line.” Bill O’Reilly
A few days ago, Gina Kolata reported in the New York Times on the paradox of studies on sexual behavior consistently reporting (heterosexual) men having more sexual partners than women, with a recent US study reporting men having a median number of 7 partners and women a median number of 4. Contrary to what’s stated in the paper, this is not mathematically impossible (key word: median). It is however quite implausible, requiring a relatively small number of women to account for a large fraction of all men’s partners.
An answer to this paradox can be found in Truth and consequences: using the bogus pipeline to examine sex differences in self-reported sexuality, by Michele Alexander and Terry Fisher, Jorunal of Sex Research 40(1), February 2003.
In their study, a sample of men and women are each divided into three groups and asked to fill a survey on sexual behavior. People in one group filled the survey alone in a room with an open door, a researcher sitting outside, and after being told the study was not anonymous; people in a second group filled the survey in a room with a closed door and an explicit assurance of anonymity; people in a third group filled the survey attached to what they believe to be a working “lie detector.”
In the first group, women reported on average 2.6 partners, men 3.7. In the second group, it was women 3.4 and men 4.2. In the third group, it was women 4.4 and men 4.0.
(The study looks at several other quantities, and some of them have even wider variance in the three settings.)
So, not surprisingly given the sexual double standards in our culture, men and women lie about their sexual behavior (men overstate, women understate), and do less so in an anonymous setting or when the lie is likely to be discovered.
Here is the reporting of the first group put to music:
[Update 8/18/07: so many people must have emailed her about the median versus average issue in the article that Gina Kolata wrote a clarification. Strangely, she does not explain, for the rest of the readers, what the difference is and why it is possible, if unlikely, to have very different medians for men and women. The claim in the clarification, by the way, is still wrong: those 9.4% of women with 15 or more partners could be accounting for all the missing sex.]
In CS70, the Berkeley freshman/sophomore class on discrete mathematics and probability for computer scientists, we conclude the section on probability with a class on how to lie with statistics. The idea is not to teach the students how to lie, but rather how not to be lied to. The lecture focuses on the correlation versus causation fallacy and on Simpson’s paradox.
My favorite way of explaining the correlation versus causation fallacy is to note that there is a high correlation between being sick and having visited a health care professional in the recent past. Hence we should prevent people from seeing doctors in order to make people healthier. Some HMOs in the US are already following this approach.
Today, a post in a New York Times science blog tells the story of a gross misuse of statistics in a Dutch trial that has now become a high-profile case. In the Dutch case two other, and common, fallacies have come up. One is, roughly speaking, neglecting to take a union bound. This is the fallacy of saying ‘I just saw the license plate California 3TDA614, what are the chances of that!’ The other is the computation of probabilities by making unwarranted independence assumptions.
Feynman has written eloquently about both, but I don’t have the references at hand. In particular, when he wrote on his Space Shuttle investigation committee work, he remarked that official documents had given exceedingly low probabilities of a major accident (of the order of one millionth per flight or less), even though past events have shown this probability to be more of the order of 1%. The low number was obtained by summing the probabilities of various scenarios, and the probability of each scenario was obtained by multiplying estimates for the probabilities that the various things that had to go wrong for that scenario to occur would indeed go wrong.
Christos Papadimitriou has the most delightful story on this fallacy. He mentioned in a lecture the Faloutsos-Faloutsos-Faloutsos paper on power law distributions in the Internet graph. One student remarked, wow, what are the chances of all the authors of a paper being called Faloutsos!
Some time ago, the New York Times reported on census data that shows that only a minority of American women are married and living with their husband. Thomas Sowell writes in National Review to complain about the way the Times misleads with statistics. He repeats points made earlier, in the same magazine, by Jennifer Morse. (Namely, that the claim depends on the definition of “woman” and of “living with.”)
But this is part of a pattern, Mr. Sowell writes, because,
Innumerable sources have quoted a statistic that half of all marriages end in divorce — another conclusion based on creative manipulation of words, rather than on hard facts.
The statistic is partly based on the fact that, in recent years, there have been about half as many divorces as marriages in any given year. It is of course not quite correct to project that half of the marriages are going to end in divorce: if the number of people getting married increases with time then, all other things being equal, the ratio of divorces to marriages in a given year underestimates the true fraction of marriages ending in divorce. Conversely, if the number of marriages goes down with time, one has an overestimate. I would suppose, however, that demographers take such trends into account in their models.
Sowell’s objection is, of course, considerably more creative:
The fact that there may be half as many divorces in a given year as there are marriages in that year does not mean that half of all marriages end in divorce.
It is completely misleading to compare all the divorces in one year — from marriages begun years and even decades earlier — with the number of marriages begun in that one year.