The traps of statistical significance

Feb 15, 2018

“Were the results statistically significant?” Whether we’re looking at polling numbers, outcomes for a curriculum for elementary school students, or marketing data, this is almost always the first question asked. In management and economics journals, statistical significance is often the only measure of whether someone has found something meaningful in the data. When applied correctly, statistical significance is undoubtedly a useful tool—but as I’ll explain, it also has many shortcomings.

People misunderstand what statistical significance really is

We constantly hear people evaluating studies on the basis of statistical significance, believing it to be proof of a study’s legitimacy. In fact, statistical significance is merely a test about data sampling fidelity. It says that a small sample from a larger population is very unlikely to occur by chance.

It’s important to understand that significance testing doesn’t tell us what’s true, but essentially proves what’s false. A statistically significant result tells us that the “null hypothesis”–most often formulated as the opposite of the hypothesis being tested—is likely to be false. But it doesn’t inform whether the proposed hypothesis is true.

Even experts confuse real-world significance with statistical significance

Stephen T. Ziliak and Diedre N. McCloskey’s book, The Cult of Statistical Significance, points out that confusion between real-world significance and statistical significance is pervasive even among experts. After examining all of the articles published in the 1980s and 1990s from The American Economic Review (AER), one of the most highly regarded economics journals in the field, they concluded that at least 70% of authors conflated real-world significance with statistical significance.

The standard threshold for statistical significance was arbitrarily decided on

The most commonly used threshold for significance was chosen arbitrarily by the genius statistician and consummate pitch man, Ronald A. Fisher. He decided that we needed an “objective” measure of significance, and decided on a significance level of .05—meaning that the likelihood a data sample occurred by chance was 5% (or less). However, sometimes we don’t need that degree of precision when we’re studying a problem. Indeed, if a significance test does not reach that level, it does not render the results meaningless; it just means we have less confidence that the correlation we find is not due to chance. It’s foolish simply to abandon studies because they don’t reach the .05 level of significance.

Missing real-world significance

Forming conclusions based on statistical significance can be harmful, as large data sets are likely to miss real-world significance. Imagine a large randomized drug trial. If the treatment group is large enough, a few deaths in the treatment group won’t rise to the level of statistical significance. Consequently, the drug company can declare that their drug is safe.

Consider another example. The 1994 Crime Bill (often known as the “three strikes” law) was devised to deter crime. The research that supported the policies in the bill relied on almost solely on statistical significance as a measure for whether the bill would lead to a drop in crime.

We now know that this bill was flawed in many ways, not least because there is only one group all researchers can agree did not go on to commit more crimes—the portion of the studied population who were still incarcerated. Put another way, in light of the already dropping crime rates in 1994, the law was effective in reducing crime by almost exactly the amount as would have been committed by the people who were still in prison. The law was intended to deter people from committing crime in the first place, but the results were that the harsher penalties and mandatory minimums were only a deterrent insofar as it kept people who were incarcerated from committing crimes. An utter failure.

We still need statistical significance

Despite the misuse of the term, statistical significance is still a valuable tool. As statistics become more pervasive in public discourse, we need to know how confident we can be that a sample from a large dataset is representative of the whole. The tricky part is that almost anyone, no matter how much training they have had in statistics, is vulnerable to two big errors in thinking: (1) taking high correlation to imply causation, and (2) confusing statistical and real-world significance.