Testing Without Theory

May 30, 2012

From time to time I run across a finding in the medical literature along the lines of “Coffee Causes Bladder Cancer.” Or was it “Coffee Prevents Bladder Cancer”? Oops, maybe it wasn’t coffee at all. Maybe it was broccoli. Or cashew nuts.

I rarely ever report these results at my blog, with the possible exception of vitamin studies and I even regret reporting on those.

From time to time I run across a finding in the medical literature along the lines of “Coffee Causes Bladder Cancer.” Or was it “Coffee Prevents Bladder Cancer”? Oops, maybe it wasn’t coffee at all. Maybe it was broccoli. Or cashew nuts.

I rarely ever report these results at my blog, with the possible exception of vitamin studies and I even regret reporting on those.

Many of these studies have the same basic problem: They involve testing without theory. Give me one group of people who drink coffee, another group that abstains and, say, several hundred health and demographic variables and I can almost guarantee you that coffee drinking (or not drinking) will correlate with something. It will probably correlate with 4 or 5 things.

The literature on spurious correlation has a number of entertaining examples of this. In one study (described here), the prices of a select list of NYSE stocks rose 87 percent of the time when the temperature reading fell at a weather station on Adak Island, Alaska. The authors note that with 3,315 stocks, chance alone insured that some were sure to be correlated with temperature measurements.

As for statistical significance, remember what a 95% confidence interval means. It means that 5% of the time, the relationship you have discovered could have been produced by random chance. If you have thousands of researchers mining thousands of data sets, they are almost guaranteed to find many spurious relationships and, unfortunately, they will get them published in peer reviewed journals as scholarly papers. The results will then appear in daily newspapers (what editor can resist a finding that coffee causes or prevents any malady?), and the public will be sorely misled.

The New Math 

What brings all this to mind is a Wall Street Journal article about two studies published in 2010 that examined whether the oral bisphosphonates commonly prescribed for osteoporosis increase esophageal and gastric cancer. They came to opposite conclusions despite using the same database:

Advances in Thyroid Care

Which conclusion was correct? Who can say? As a researcher from the National Institute of Statistical Sciences put it “There is enough wrong with both papers that we can’t be sure.”

The Journal article is focused on the difference between “randomly controlled clinical trials” and “observational studies,” which analyze previously gathered data. The author, Gautam Naik, regards the former as the “gold standard” for testing and apparently considers the latter technique suspect. Since observational studies are easier to do and less expensive, there are more of them. In fact, over the past decade there were 263,557 such studies reported in 11,600 peer reviewed journals, worldwide. Naik explains the problem as follows:

[O]bservational studies in general can be replicated only 20% of the time, versus 80% for large, well-designed randomly controlled trials, says Dr. Ioannidis. Dr. Young, meanwhile, pegs the replication rate for observational data at an even lower 5% to 10%.

Whatever the figure, it suggests that a lot more of these studies are getting published. Those papers can often trigger pointless follow-on research and affect real-world practices.

But hold on. Randomly controlled trials can be replicated only 80% of the time? So doctors who rely on the “gold standard” in treating their patients will be wrong one out of every five times?

In reality, they might be wrong more often than that. Groups from pharmaceutical companies and biotech venture capital firms have reported difficulty reproducing “foundational” academic research from academic labs. All of these groups have an interest in assessing the quality of academic reports before they invest millions of dollars in trying to translate seemingly promising research into something physicians can use to benefit patients. According to Bruce Booth of Atlas Venture, “the unspoken rule is that at least 50% of the studies published even in top tier academic journals…can’t be repeated with the same conclusions by an industrial lab. In particular, key animal models often don’t reproduce.”

Support for Booth’s assertion comes from C. Glenn Begley of Amgen and Lee Ellis, an M.D. Anderson Cancer Center researcher. They published a March 2012 paper in Nature calling for higher standards in preclinical cancer research. Of 53 papers chosen as “landmark” studies, only 6 had results that were reproducible by Amgen researchers. The authors note that some of the irreproducible clinical papers had spawned entire fields of literature with hundreds of papers expanding on elements of the original observation. Worse, some even triggered a “series of clinical trials.” In 2011, researchers at Bayer published broadly similar findings.

New Study Links Infertility to Flame Retardants in Furniture and Carpet

Since my background is economics, let me say for the record that almost none of these studies would ever be accepted for publication in an economics journal. The reason? They almost all involve testing without theory. Economics journals usually don’t publish results showing random “links” between variables, even when the relationship is statistically significant. Instead, authors are usually required to have a defensible theory about why a relationship might be expected to exist, and to derive testable implications from that theory. If the theory survives one test, it will generally be subjected to more tests. If it fails several empirical tests, it will generally be discarded.

Here, for example, is a simple theory of cancer (which may be right or wrong). Cancer susceptibility begins with genes. If you have a parent or grandparent who experienced a certain type of cancer, you are more likely to get the same cancer. But maybe your risk of a specific cancer is also heightened if you have a family history of some other type of cancer. Environment and your behavior with respect to that environment also matters. More education and more income enhance your ability to avoid cancer risks. So the more educated you are and the higher your income, the lower your risk.

There. That’s a theory with some plausibility. I believe it’s probably consistent with a lot of evidence. Now let’s take up the question of coffee drinking. It’s not enough to find a difference in cancer incidence between the drinkers and the nondrinkers. My theory requires me to also adjust for family history, education, income, etc.

Anyone who has ever done regression analysis knows that the adding or dropping a variable can cause the correlation coeffient for some other varable to change signs (to go from positive to negative, for example) or to go from “significant” to “not significant,” or vice versa.

Even when you are testing with a plausible theory you can find spurious correlations. But testing without theory almost guarantees it.