Understanding p-Values

We see p-values reported all the time in manuscripts.  We know p<0.05 is somehow magic, and we know that it has something to do with probability and we think that’s why ‘p’ is used.  But what are they really, and what’s so magic about p<0.05?

Simply stated, a p-value is the probability of witnessing the observed data given some assumption.

For statistical tests on a single sample like those that test for normality, the p-value is reporting the probability that the observed sample being tested would be seen [1]or one that is more extreme assuming the sample came from a population with the distribution being tested for (in this case normal).

For statistical tests comparing two or more samples, the p-value is reporting the probability that any difference [2]or a difference even more extreme between the two samples might be observed assuming they were both actually drawn from the same population.

Why p<0.05?

When we say that we will accept ‘p<0.05’ as statistically significant, we are saying that we will accept a risk of only 5% or less that the null hypothesis [3]that there is no difference between the sampling distributions, usually is true when in fact it was false.  More plainly, we’re accepting a 5% chance that we’re making a conclusion about a difference that isn’t true.  This is what we refer to as the alpha risk or Type I-error risk. [4]Type I/alpha you can remember versus Type II/beta because I and alpha come first, and concluding there is a difference when there isn’t one is “worse” in general than concluding there was no difference when in fact there was.

So what’s magic about 0.05, or 5%?

Well, in the end, nothing.  We just as easily could have accepted 0.10 or 0.025 or 0.01 as our ‘statistically significant’ cutoff, and in some cases authors do just that with appropriate justifications of the risk balances. But we’re stuck with it now, and there have been many discussions of the implications of this universally accepted cutoff and the effect it has had on the reproducibility of significant results.

This is also the reason I personally don’t have any problem with authors describing a ‘trend’ when p=~0.06 – 0.1.  Nothing magic happens between p= 0.049 and p=0.051; these results are far more comparable than p=0.049 and p=0.01, even though the p=0.051 is not considered statistically significant and the other two are.  This is also the reason why many journals are now pushing to have confidence intervals reported now in addition to p-value, as these provide additional information about the certainty of the conclusions being made.

 

 

References   [ + ]

1. or one that is more extreme
2. or a difference even more extreme
3. that there is no difference between the sampling distributions, usually
4. Type I/alpha you can remember versus Type II/beta because I and alpha come first, and concluding there is a difference when there isn’t one is “worse” in general than concluding there was no difference when in fact there was.