Correlations

When we want to evaluate the relationship between two different variables, looking at how one changes as the other changes, we use correlations.  Correlations can be done between any data types.  In general the correlation result will have two values: 1) the p-value, which shows the statistical significance of the result; and 2) the r2 value, which shows the strength of the association.  The latter is evaluated on a 0-to-1 scale, with 0 representing no effect, 0.2-0.4 a weak effect, 0.4-0.6 a modest effect, and 0.7-1.0 a very strong relationship.  (Some tests may include 0 to -1; in these test a negative r2 indicates as one variable increases the other variable decreases).   A chart of common correlation tests is shown below with some notes to help distinguish them.

Additional notes about some of the distributions are shown below.

Final point of caution with all tests of correlation: beware of outliers (extreme values).  Outliers can very strongly influence the results of a correlation test, overwhelming the rest of the data.  Tests that are robust to outliers are noted.

Correlation Test Scalar Data Ordinal Data Nominal Data Detects non-linear? Zero is meaningful? Range Notes
Pearson’s * -1 to 1 Linear only
Distance Correlation -1 to 1 See below
Spearman’s Rho -1 to 1 Avoid with integers, poor with ties
Kendall’s Tau b -1 to 1 Takes ties into account
Kendall’s Tau c -1 to 1 Ignores ties, probably better than Rho
Goodman & Kruskal’s Gamma -1 to 1
Yule’s Q -1 to 1 For 2×2 tables
Somers’ D 0 to 1 X –> Y **
Contingency Coefficient 0 to 1
Phi Coefficient -1 to 1 Use with 2 level vars
Cramer’s V 0 to 1 Use with 3+ level vars
Lambda 0 to 1 X –> Y **
Uncertainty Coefficient 0 to 1 X –> Y **
*Only if X & Y are both independently and jointly normally distributed does zero indicated no correlation with Pearson’s
** A test that is used when attempting to predict the second variable based on the value of the first

For reporting correlation results, in addition to the p-value, report the correlation coefficient (the value between -1 and 1) showing the strength of the relationship.

Pearson Correlation Coefficient – Parametric, sensitive only to a linear relationship, and obviously therefore only for continuous data.  If uncorrelated then Pearson’s is 0, but if Pearson’s is 0 it does not mean there is no correlation (unless X and Y are both independently normal and jointly normal).  Sensitive to outliers. Distance correlation broadly superior and should probably be used.

Distance Correlation – Designed to address deficiency of Pearson Correlation’s non-detection of relationships that are non-linear.  Most significant property is that with distance correlation, a measure of 0 does mean that the random variables are statistically independent.

Spearman’s Rho – Non-parametric, rank-based. Can detect non-linear relationships like curves.  Really an attempt to extend Pearson’s over non-linear variables. Confidence intervals for Rho may be less reliable and interpretable than for Kendall’s Tau.

Kendall’s Tau – Non-parametric, rank-based. Can detect non-linear relationships. May have better statistical properties than Spearman as it was designed specifically as a statistical test for non-parametric correlation data.  Tau-a is used for cross-tabs of ordinal variables.  Tau-b makes adjustments for ties.  Tau-c is better than Tau-b for rectangular tables. If one variable is dependent and one independent, use Somers’ d instead.

Goodman & Kruskal’s gamma – Used for two ordinal variables, especially with many tied ranks. Can be used for just two categories but Yule’s Q is probably better.  Again assumes monotonic relationship between variables.