Anyone who has taken an introduction to psychology, social sciences, or a statistics class has heard the old adage, “correlation does not imply causation.” This rule posits that just because two trends seem to fluctuate in tandem, this common variance is not enough to prove that they are meaningfully related to one another.
While this sounds nice enough on paper, it is easy to forget when a provocative headline like “Does This Ad Make Me Fat” tricks us into believing a scenario we wish to be true. It would be ideal if we could blame America’s obesity problem on a common annoyance surrounding us everyday, excessive advertising.
However, this causality assumption, that any link between obesity and advertising occurs because more obesogenic advertising causes higher rates of obesity, remains unjustified. In fact, as the article discusses, the converse scenario could be equally true. If vendors of obesogenic food products believe obese people are more likely than non-obese people to buy their products, they would naturally, and logically, place more ads in areas where obese people already live.
The fact that the causal conclusion may coincide with a humanic and moral belief, for example, that it is wrong to tempt people who overeat by showing them ads for obesogenic food, does not make it a valid assumption. As with any logical fallacy however, identifying that the reasoning behind an argument is flawed does not imply that the resulting conclusion is guaranteed to be false.
Correlation does not imply causation, but it sure can provide a hint of the underlying behavior. Blindly asserting that correlation does not imply causation as a final shutdown can be dangerous, even Orwellian. For instance, Is zero correlation not evidence against causation?
In the world of Orwell’s 1984,
“To the end of suppressing any unorthodoxy, the [ruling] Party inculcates self-deceptive habits of mind to the inner and outer members, thus crimestop (“preventive stupidity”) halts thinking at the threshold of politically-dangerous thought.”
Correlation can define a rigid frame for the causality question, inevitably leading us down the path toward thinking through the workings of reality, where we might even obtain new layers of knowledge. It helps us go from seeing things to understanding them, and should not be merely dismissed with an overused statistical cliche.
With this comprehension established, let us examine a few poignant circumstances that illustrate the discussed fallacy. There are many hilarious examples of simple situations where two events coincide, or happen in succession, but neither is the direct cause of the other.
To illustrate how easy it is to fool yourself with false correlations between independent data, we reproduced and updated an example by Professor J.H. McDonald of the University of Delaware from his Handbook of Biological Statistics. We looked at the correlation between the population of moose on Isle Royale in the winter and the number of strikeouts thrown by Major League baseball teams the following season, using data for ten years from 2002–2011. We did this individually for each baseball team, so there were 30 statistical tests.
The null hypothesis, naturally, is that these two variables are not correlated. We are fairly confident that the null hypothesis is true, since personally we cannot think of anything that would affect both moose abundance in the winter and strikeouts the following summer. With 30 baseball teams then, we would expect the P-value to be less than 0.05 for 5% of the teams, or around one or two.
Here is a link to the Isle Royale data, and here is a link to the baseball data we used.
Surprisingly, the P-value is significant for 6 teams, with a 7th right on the borderline. This means that if you were silly enough to test the correlation of moose numbers and strikeouts for your favorite team, you would have almost a 25% chance of convincing yourself there was a relationship between the two factors. Some of the correlations actually even look pretty good!
For example, strikeout numbers by the Arizona Diamondbacks and moose population numbers have an R^2 of 0.72 and a P-value of 0.0018:
Perhaps not incidentally, many websites have cropped up recently to pay homage to these misleading correlations as the Internet feeds its obsession. Our personally preferred site is appropriately called Spurious Correlations by Tyler Vigen–or more accurately, Tyler Vigen’s software. He created code to spot correlations in public data sets and pump out often hilarious graphs in response. Here are a couple of our favorites that he discovered:
Who knew that the number of people who drowned by falling into a swimming-pool correlates with the number of films Nicolas Cage appeared in that year?
And who could anticipate that the amount of letters in the winning word of the Scripps National Spelling Bee correlates with the number of people killed by venomous spiders that year as well?
The answer is nobody, which is precisely the point, because these crazy factoids are only related by chance… right? Maybe just to be safe, Nicolas Cage should stop appearing in movies, Scripps should only offer short words for the final round, and MLB teams should encourage a high moose population on Isle Royale. You know, to save lives and all.
What do you think? Is there value in identifying such a myriad of correlations and should we stop blindly repeating the famous statistical mantra, correlation does not imply causation? What steps should we take to improve initial hypotheses on a data set?
As always, our code can be found on GitHub.
The SaberSmart Team