Statistical significance

Statistical significance refers to the observation of data that would be presumed unlikely if the null hypothesis were true. It is declared when the probability of observing data as "extreme" under the assumption that the null hypothesis is true (the p-value) is below some arbitrary value, represented by a lower-case alpha (α). A p-value below alpha suggests that the data is not consistent with a true null hypothesis. Conventionally, the null hypothesis is then rejected, statistical significance declared, and parties commenced.

The word "significant", in this sense, does not mean "large" or "important" as it does in the everyday use of the word. It just means an effect is observed to be larger than what chance alone would likely cause. Statistically significant effects can, in fact, be very small. Typically, larger sample sizes are required to demonstrate significance of smaller effects.

Basics of statistical significance

 * 1) Start with data collected especially for the question at hand
 * 2) Clearly formulate a null and alternative hypothesis
 * 3) Establish an alpha level appropriate for the study
 * 4) Compute a test statistic and its corresponding probability under the null hypothesis (or use a computer to do it).
 * 5) Compare the p-value to the alpha level and
 * 6) Reject the null if the p-value is lower
 * 7) Don't reject the null if the p-value is higher

In proper hypothesis testing, alpha is determined prior to collecting data. Selecting the proper alpha should be based on thoughtful contemplation of the risks of drawing the wrong conclusion, but, is commonly set at either 0.05 or 0.01. There is a trade-off between significance and statistical power (the probability that the null hypothesis is rejected given that it's false). A low alpha value means that a rejection of the null hypothesis is less likely to be in error, but it also decreases the chance of having such a rejection. Increasing the sample size can increase the likelihood of significance without decreasing power.

In more common statistical approaches (i.e. "frequentist"), statistical significance emerges from the results of hypothesis testing. An alternative hypothesis (that there is an effect) is favoured &mdash; and a null hypothesis (that there is not an effect) is rejected &mdash; if the experimental evidence shows a significant difference from the null hypothesis. If a significant difference is not present, the null hypothesis is not rejected.

To be clear, statistical significance testing does not prove either hypothesis. Rejecting the null just suggests that the evidence disfavors the null enough for us to jump into the arms of the welcoming alternative hypothesis. Not rejecting the null hypothesis says that either the null hypothesis is probably true or there is not enough evidence to reject it; it does not prove the null hypothesis. Statistical significance is just a way to make a statement about the strength of the evidence.

Alpha value versus p-value
Hypothesis testing consists of formulating a null hypothesis and alternative hypothesis, choosing an alpha value, determining the rejection region, collecting data, calculating a statistic, and evaluating whether the statistic falls in the rejection region. There are four possible results of a hypothesis test: the null hypothesis is true and retained, the null hypothesis is false and rejected, the null hypothesis is true and rejected, and the null hypothesis is false and retained. If the null hypothesis is true, but is rejected, that is a Type I error. If the null hypothesis is false, but retained, that is a Type II error. The probability of a Type I error is, by definition, equal to the alpha value. The probability of a Type II error generally cannot be calculated, as the alternative hypothesis does not incorporate a known distribution. If the possible results of experiment can be ordered as "most likely" (given the null hypothesis) to "least likely", then the actual results can be assigned a value equal to the probability of those results, plus all "less likely" results. This probability is known as the "p-value". If the p-value is less than the alpha value, the null hypothesis is rejected. The significance of the test is determined by the alpha value, which is unaffected by the test results. The only effect the p-value has is that it is either less than the alpha value, and the null hypothesis is rejected, or it is more than the alpha value, and the null hypothesis is retained. A result does not become "more" statistically significant if the p-value is "a lot smaller" than the alpha value, as opposed to being simply "slightly smaller".

Abuse
An abuse of statistics is when journalists or certain agenda pushers ignore the concept of significance entirely &mdash; leading to false information being given out to people. In 2005, a report commissioned by the UK government concluded that there had been "no significant increase in drug use in UK schools". Not content with the conclusion that "things aren't that bad, actually", a few newspapers jumped on the report and decided to draw their own conclusions. In their, frankly amateurish, search for something to data mine (post-designation), they noticed that cocaine use in schools went from 1% to 2% &mdash; although these were rounded off for the summary, it was actually 1.4% and 1.9%, so a 35% increase, rather than a 100% increase. They had their smoking gun; despite what the government concluded, cocaine use had doubled, cocaine was flooding the playground, and the government was covering it up. However, the government's conclusion was more accurate, because it took into account significance, clustering, and the fact that the use of many different drugs had been polled. If you test many variables, the probability of one of them showing a clear trend by chance increases, and so tests for significance have to be altered appropriately. Upon doing the actual maths, the results were actually very insignificant, essentially produced by accident and the random chance that the sample would have fallen on a cluster of individuals using drugs that wasn't representative of the whole sample.

Problems with statistical significance
The alpha value is set usually at 0.05 or less. This means that there is a less than five percent chance of rejecting the null hypothesis by chance alone. There is nothing fundamentally magic about an alpha level of 0.05, yet after many generations of using it in analysis, it seems to have taken on a certain magical value for many sciences. If a statistical test comes back with p=0.04, results are called significant, and if p=0.06, they are called non-significant.

With this standard alpha level, about 1 in 20 results should come back significant when there really is no effect. This does happen frequently, so it is wrong to assume a good value means you're completely certain; it's still all about probability. In individual experiments that run many statistical tests, this is a problem; if you run 40 tests, about 2 of them will show an effect that is not really there. This is often referred to as a family-wise error rate and is difficult to control for, but some measures can be used. While it is easy to see this problem in a single set of experiments in a single paper, the same phenomenon emerges if a bunch of single experiments are published in multiple papers. With the thousands of experiments run every day all over the world, a very large number of them will show a statistical significance when there really is no effect at all. Publishing biases in journals exaggerate this problem because journals rarely publish experiments that show only a non-effect (i.e., "failed" experiments), and are much more likely to publish papers that show an effect. So one winds up with a massive uncontrolled bias in the published papers towards showing statistical significance where there really is none.

A related problem is the common misunderstanding of what that 0.05 is the probability of. It is often thought to be the probability of a significant result being a fluke, but in fact it is the probability, assuming that all effects are flukes (the null hypothesis is true), of obtaining a significant result anyway. Having obtained a significant result, the probability that it is a fluke after all often remains much higher than the alpha level. An intuition for this can be obtained by noting that if all the null hypotheses tested are true, then 5% of all results will be statistically significant but 100% of statistically significant results will be flukes, by construction.

Abuse from pseudoscience
This is one reason why picking out a single test in a single paper to make a point is meaningless. It is a common tactic in pseudoscience to search through thousands of papers to find that one result that's significant and makes their point. Real science must be accompanied by the preponderance of evidence, and experimental results need to be replicated repeatedly and reliably before they should be incorporated in the body of accepted knowledge. This is why scientific consensus is important and quacks and cranks that go against this consensus do not gain points by finding a single example in a paper that might support their claims.

The problems above are due mostly to the use of frequentist approaches to statistical analysis. There is a growing movement of scientists who are encouraging the use of Bayesian-based statistics. Bayesian approaches are not subject to the same sort of systematic error propagation issues as frequentist approaches. However, they are subject to their own unique sets of issues (although Bayesians will deny that those issues are unique to their approach ).

P-value fishing or "p-hacking"
P-value fishing (a.k.a. fishing expedition), more commonly known as "p-hacking", is a pejorative term for a statistical sleight of hand often abused by cranks and those with an agenda to push. There are two common ways to get a statistically significant result that doesn't mean much at all. The first is, in studies with a large number of variables, to run comparisons of all the variables and hope that something comes out significant. Proper methodology dictates that the experimenter choose which variables are being compared beforehand and to run post-hoc corrections on any further comparisons. In other words, just comparing as many variables as possible will eventually turn up a significant result, though it's likely to be statistical noise. The post-hoc correction either reduces the post-hoc test's alpha level or increases its p-value so that the family-wise error rate (e.g. 1 out of 20) is maintained.

The second trick is to fish for p-values by cranking up the number of subjects until significance is achieved. Normally, it's good to have more subjects; however, the data should be interpreted in light of that. What often happens with a large subject pool is that even a slight difference in means will become significant, even though the effect size is close to nothing. This is why it's important to look at the effect size in addition to the p-value. Further, if the number of subjects is increased because a first analysis yielded a non-significant p-value, this must be reflected in the analysis, similarly to when multiple analyses are conducted at the same time.

A 2011 paper by Joseph P. Simmons et al. showed that despite despite outward commitment among psychological researchers for low rate of false positives by endorsing the use of α=0.05, researchers are often able to manipulate the results by choosing when to stop data collection, which variables to measure, and which statistical methods to use. Such practices by researchers, when unchecked by journal policies and article reviewers, is likely to have been contributory to the replication crisis in psychology.

Proposed solutions to the problems
Another approach has been to argue that statistics needs to lose its magical status in science as some sort of analogy to a proof, but rather needs to be seen as an argument or a measure of the strength of evidence. The p-value of statistics are just one piece in the broader perspective and should be weighed against other types of evidence. P-values can be reported directly, allowing people to integrate them with other evidence in making their conclusions. If other evidence is weak, maybe a p-value of 0.05 is not convincing, or maybe if all the other evidence is strong, a p-value of 0.1 is good enough. However, this is problematic, as dealing directly with p-values opens up the possibility of a large variety of statistical fallacies, such as multiplying the p-values of two studies to get the "combined" p-value.

Focusing on confidence intervals instead of p-values is thought to provide a more flexible and less arbitrary method to weight evidence, although this is arguably based on a misconception of what confidence intervals can or cannot say. While some 95% confidence intervals can be interpreted as rejecting the null hypothesis of some value outside the confidence interval at an alpha of 0.05, in general the confidence interval itself should not be misinterpreted as representing all the values in the plausible range or used to decide whether this estimate of a difference is precise enough to be worth relying on, unless it is known to have been produced by a confidence procedure that has been verified to provide such properties. That it happens to be a confidence interval is irrelevant.