Wednesday, October 31, 2007


Hmm. I may have a bit of a block on this. Let's just try to bull ahead then, and see what happens.

Over on Mark Thoma's Economist's View blog, there were a couple of discussions about a, well, let's call it a "raging debate," albeit one in fairly slow motion. The backstory papers are here:

McCloskey and Ziliak, "The Standard Error of Regressions," Journal of Economic Literature 1996.

Ziliak and McCloskey, "Size Matters: The Standard Error of Regressions in the American Economic Review," Journal of Socio-Economics 2004.

Hoover and Siegler, "Sound and Fury: McCloskey and Significance Testing in Economics," Journal of Economic Methodology, 2008.

McCloskey and Ziliak, "Signifying Nothing: Reply to Hoover and Siegler."

These papers were pulled from an entry on "Significance Testing in Economics" by Andrew Gelman, and there followed two discussions at Economist's View:

"Tests of Statistical Significance in Economics" and later, a response by one of the main players (McCloskey), followed by my arguing with a poster named notsneaky. That led to my essay, "The Authority of Science."

Okay, you are allowed to say, "Yeesh."

So let me boil down some of this. McCloskey published a book in 1985, entitled, The Rhetoric of Economics, in which she argued that the term "Statistical Significance" occupied a pernicious position in economics, and some other sciences. The 1996 paper by McCloskey and Ziliak (M&Z) continued this argument, and the 2004 paper documented a quantitative method for illustrating the misuse of statistics that derived from what was, basically, an error in rhetoric, the connecting the word "significant" to certain sorts of statistical tests. The forthcoming (to be published in 2008, the link is to a draft) paper by Hoover and Siegler (H&S) finally rises to the bait, and presents a no-holds-barred critique of M&Z. Then M&Z reply, etc.

Any of my readers who managed to slog through my criticisms of the use of the word "rent" (See "Playing the Rent" and subsequent essays) in economics (as in "rent-seeking behavior"), will understand that I start off on the side of the rhetoriticians. When a technical subject uses a word in a special, technical sense that is substantially different from its common language use, there is trouble to be had. "Significant" carries the meaning of "important," or "substantial" around with it, but something that is "statistically significant" is simply something that is statistically different from the "null hypothesis" at some level of probability. Often, that level of probability is arbitrarily set to a value like 95%, or two standard deviations, two sigma, which is about 98% for a normal distribution.

(I'll note here that in statistical sampling, one usually uses something like the t-distribution, which only turns into the normal distribution when the number of samples is infinite, so it adds additional uncertainty for the size of the sample. The t-distribution also assumes that the underlying distribution being sampled is normal, which is rarely a good assumption at the levels of reliability that are being demanded, so the assumption train has run off the rails pretty early on).

But some differences make no difference. Given precise enough measurements, one can certain establish that one purchased pound of ground beef is actually one and one thousandths of a pound, but no one who purchased it would feel that they were getting a better deal than if they'd gotten a package that was one thousandth of a pound light. We just don't care about that small a difference; some of the beef is going to stick to the package.

I saw something written recently that referred to something as "statistically reliable," and on the face of it, that would be a much better phrase than "statistically significant," and I will use it hereafter, except when writing about the misused phrase, which I will put in quotes.

So, okay, "statistically significant" is not necessarily "significant." Furthermore, everyone agrees that this is so. But one disagreement is whether or not everyone acts as if this were so. And that is where M&Z's second criticism comes in: that many economics journals (plus some other sciences) simply reject any paper that does not show results at greater than 95% reliability, i.e. the results must be "statistically significant." M&Z say outright that the level of reliability should adapt to the actual importance of the question at hand.

The flip side of this is that, in presenting their work, authors sometimes use "statistically significant" as if it really mean "significant" or "important," rather than just reliable.

Alternately, one can simply report the reliability statistic, the so-called "p value," which is a measure of how likely the result is to have come about simply because of sampling error. I have, for example, published results with p values of 10%, meaning that there was one chance in 10 of the result being just coincidence. I've seen some other p values that were much lower, and those are usually given in the spirit of "there might be something here worth knowing, so maybe someone should do some further work."

In fact, this giving lower p values, or using error bars at the single sigma level, is fairly standard practice is some sciences, like physics, chemistry, geology, and so forth. Engineers usually present things that way as well. On the other hand, the vague use of "significant" that M&Z criticize is often used in social sciences other than economics, e.g. psychology and sociology, as well as some of the biological sciences, including especially, medicine.

It's in medicine where all this begins to get a tad creepy. In one of their papers, M&Z refer to a study (of small doses of aspirin on cardiovascular diseases like heart attack and stroke) as having been cancelled, for ethical reasons, before the results reached "statistical significance." "Ha!" exclaim H&S (I am paraphrasing for dramatic effect). "You didn't read the study, merely a comment on it from elsewhere! In fact, when the study was terminated, the aspirin was found to be beneficial to myocardial infarction (both lethal and non-lethal) at the level of p=0.00001, well past the level of statistical significance! It was only stroke deaths and total mortality that had not reached the level of p=0.05!"

Well, that would surely score points in a high school debate, but let's unpack that argument a bit. M&Z say that the phrase "statistically significant" is used as a filter for results, and what do H&S do? They concentrate on the results that were found to be statistically reliable at a high level. How about the stroke deaths? What was the p value? H&S do not even mention it.

(As an aside, I will note that the very concept of a p value of 0.00001 is pretty ridiculous. Here we have an example of the concept of statistical reliability swamping actual reliability. The probability of any distribution perfectly meeting the underlying statistical assumptions of the t-distrubution is indistinguishable from zero, and the likelihood of some other confounding factor intervening at a level of more than once per hundred thousand is nigh onto one).

Furthermore, H&S use a little example involving an accidental coincidence of jellybeans seeming to cure migraines to show why one must use "statistical significance." Then, when discussing the aspirin study, they invoke the jellybean example. On the face of it, this looks like they are equating migraines with heart attacks and strokes, again, completely ignoring the context in which samples are taken, in order to focus on the statistics. In many ways, it looks like H&S provide more in the way of confirming examples of M&Z's hypothesis than good arguments against it.

Also consider what H&S are saying about the aspirin study, that there was a period of time when members of the control group were dying, when the statistical reliability of the medication had been demonstrated, but the study had yet to be terminated. Possibly the study did not have an ongoing analysis, and depended upon certain predetermined analysis and decision points. But how would such points be selected? By estimating how long it would take for the numbers to be "statistically significant?"

Some studies definitely do use ongoing statistical analyses. Are there really studies where a medication has be been shown to be life-saving, to a statistical reliability of 90%, where patients are still dying while the analysts are waiting for the numbers to exceed 95%? How about cases where medications are found to have lethal side effects, but remain on the market until the evidence exceeds "statistical significance?"

The blood runs a little cold at that, doesn't it?


black dog barking said...

Yeesh. My ignorance of the technology of statistics is stunning. It does help me to know how the "t" and "p" things are used to transform # into idea, I feel informed.

From my vantage it looks like an anecdotally significant portion of We craves binary resolution -- is or ain't: choose. Under that banner "statistical significance" provides a way to mechanically reduce analog reality to a digital label.

I've learned to tread carefully around decimals like 0.00001, sniff before stepping. If my arithmetic is correct GPS coordinates at that level label individual square meters on the earth's surface.

Finally, this for "The Duck as Spirit Guide" -- Groucho in that clip triggered a strong associative link to Frank Zappa. Coincidence? Zappa tribute?

(Captcha code: dukwzi. Duck was I?)

James Killus said...

Oh Zappa. Sometime maybe I can write about Zappa, but there's so much to say, and so much of it defies analysis.

Sven DiMilo said...

Oy, the magical .05. It seems a shame that this conversation has to go on independently in every separate science and wanna-be-science. In ecology, especially wildlife ecology, there is a long tradition of decrying the use of hypothesis testing with arbitrary "significance" criteria (some pertinent references below for the curious), but the approach is still used all the time. I use a lot of ANOVA and ANCOVA in my work, and I've taken to just reporting the P values that SYSTAT spits out and letting readers draw their own conclusions of reliability.

Yoccoz, N.G. 1991. Use, overuse, and misuse of significance tests in evolutionary biology and ecology. Bull. Ecol. Soc. Am. 72: 106-111

Johnson, D. H. 1999. The insignificance of statistical significance testing. Journal of Wildlife Management 63:763-772.

Anderson, D. R., K. P. Burnham and W. L. Thompson. 2000. Null hypothesis testing: problems, prevalence, and an alternative. Journal of Wildlife Management 64:912-923.

Guthery, F. S.,J. J. Lusk, M. J. Peterson. 2001. The fall of the null hypothesis: liabilities and opportunities. Journal of Wildlife Management 65:379-384.

Robinson, D. H., and H. Wainer. 2002. On the past and future of null hypothesis significance testing. Journal of Wildlife Management 66:263-271.

James Killus said...

Let's not leave out the "grand data group grope," where you take everything you've got and regress it against everything you've got. Then the naif who does this thinks that the 20 "significant" correlations, out of the 400 that he's run, actually means something.

Of course sometimes it's 2100 out of 40,000, and the slightly less naive fellow thinks that there must be a hundred real results there. And maybe there are. Needle, meet haystack.

As you may guess, I'm of the "just give the damn p value" school of thought.

Blogger said...

Teeth Night Guard is providing precise fitting and high quality custom made teeth protectors.