Chapter 10 Correlation


10.1 Pearson Correlation Coefficient

Correlation indicates the strength and direction of a relationship between two variables (Schober P., Christa B., Lothar S. A., 2018). Its values range from -1 to 1, where a value of -1 indicates a perfect negative linear relationship (as one variable increases, the other decreases proportionally), 1 indicates a perfect positive linear relationship (as one variable increases, the other increases proportionally), and 0 indicates no linear relationship between the two variables (Swinscow, T.D.V 1997).

The extreme values of -1 and 1 indicate perfect linear relationships and for these relationships all of the data points fall on a single line. In practice it is rare to ever see a correlation this strong. Correlation is measured through the Pearson’s Correlation Coefficient which is represented by the Greek letter \(\rho\) for a population parameter or the lower case letter r for a sample statistic.

Correlation can be used to summarize the relationship between two variables and can also be used to make predictions about one variable based on the values of the other. Hypothesis tests and confidence intervals can be used to address the statistical significance of the results (Schober P., Christa B., Lothar S. A., 2018).

Height and weight are an example of two variables which are correlated. As height increases weight also tends to increase. This means that we can reasonably predict that people who are tall will weigh more than people who are short. Correlation is a quantitative assessment of both the direction and the strength of this tendency for variables to change together.

Values of r which lie between 0 and 0.3 are typically considered to be weak, between 0.3 and 0.5 are considered to be moderate and greater than 0.5 are considered to be strong although there are differences of opinion on this as the scale is somewhat subjective and different fields of study will deem relationships to be strong when other fields would consider them to be moderate or even weak. As a general rule however these guidelines are generally accurate.

10.1.1 Examples

The figure below shows a perfect negative correlation where r=-1. It’s clear from the scatter plot that as x increase the values of y decrease.

The figure below illustrates what two variables with no correlation might look like when plotted in a scatter plot. In this case, r = 0. The data has no discernible pattern through which x and y can be related to one another.

10.2 Identifying Correlation

The simplest way to determine whether two variables are correlated is to plot them together. Scatterplots are particularly useful for checking whether there might be some relationship between pairs of continuous data.

The scatterplot below shows some simulated height and weight data with a line of best fit. At a glance it’s clear that as height increases weight increases. It’s not a perfect relationship however. Looking at various heights you can see there are clusters of weight values associated with them. There are taller people who weigh less than some of their shorter peers and there are shorter people who weigh more than their taller peers.

Pearson’s correlation takes all of the data points and represents them with a single summary statistic. In this case the output indicates a correlation of 0.8.

10.3 Calculating the Pearson Correlation Coefficient

Calculating the correlation coefficient is not simple and the best way to calculate it is to use R, Python, Excel, SPSS or some equivalent software. In Excel, calculating the value of r is as simple as typing:

=PEARSON(array1, array2)

This function takes two arguments, each of which should be an array containing the values of the variables for which the Pearson correlation coefficient is to be calculated. The first array should contain the set of independent variables and the second the set of dependent variables.

The calculations that are performed in the background are described below but it isn’t necessary to manually calculate r or its associated test statistics as the process is long and prone to error. The details are provided for context to help understand what goes on in the background.

The formula for calculating the Pearson correlation coefficient is (Swinscow, T.D.V 1997):

\[r = \frac{n \sum{xy}-\sum{x} \sum{y}}{\sqrt{[n \sum{x^2}-(\sum{x})^2][n \sum{y^2}-(\sum{y})^2]}}, \]

where \(r\) is the Pearson correlation coefficient, \(n\) is the total number of observations and \(x\) and \(y\) are the recorded data values (in the case of the example above these would be height and weight measurements). This is equivalent to writing:

\[r=\frac{\sum{(x-\bar{x})(y-\bar{y})}}{\sqrt{\sum{(x-\bar{x})^2}\sum{(y-\bar{y})^2}}}.\] Notice that the denominator in this formula is similar to the formula for variance in that it sums the values of \(x\) minus the mean of \(x\) and squares the result.

Essentially the denominator in the formula for r describes the product of the two variables’ standard deviations and standard deviation is obtained by taking the square root of variance. The numerator in the formula for r is the covariance. Intuitively it makes sense that the spread should matter when trying to determine if two variables are related - when two variables are perfectly correlated they form a straight line in a scatter plot but when they’re poorly correlated they appear more like a cloud of points in a scatter plot.

The r value for the height and weight data presented above is approximately 0.8.

The Pearson correlation coefficient can also be used to test whether a relationship between two variables is significant using hypothesis testing where the null hypothesis represents the assumption that the Pearson correlation of the population is equal to zero and the alternative hypothesis represents the assumption that it is not equal to zero.

To test the hypotheses it is necessary to calculate a t-value (Swinscow, T.D.V 1997):

\[t=\frac{r}{\sqrt{\frac{1-r^2}{n-2}}}.\]

For the data above this gives:

\[t=\frac{0.8}{\sqrt{\frac{1-0.8^2}{80-2}}},\] \[t=\frac{0.8}{\sqrt{\frac{0.36}{78}}}= 11.7.\]

The next step is to find the critical value of t. The critical value of t can be found in a t-table. To find the critical value the degrees of freedom are needed (df=n-2); the significance level (0.05 by convention) and whether the test is one-tailed or two-tailed (two-tailed is most frequently applied).

If the calculated t value is greater than the critical value, then the relationship is statistically significant and the null hypothesis is rejected. If however the calculated t value is less than the critical value, then the relationship is not statistically significant and the null hypothesis cannot be rejected.

From the t table the critical value of t in this case is 1.990 and the calculated value of t is 11.7 which is much greater than the critical value so the null hypothesis is rejected.

10.4 When to use Pearson Correlation Coefficient

The Pearson correlation coefficient is one of several correlation coefficients to choose from. It is a good choice when both variables being examined are quantitative; when then variables are normally distributed; the data is free from outliers; and the relationship is linear (the relationship between the two variables can be described by a straight line).

Spearman’s rank correlation coefficient may be a better choice when the variables are ordinal; aren’t normally distributed; the data contains outliers or the relationship between the variables is non-linear and monotonic.

These techniques are also closely related to linear regression however they all serve distinct purposes. A linear regression allows us to fit a line through data points in a scatter plot and estimate the values of the dependent variable from the independent variable but it does not provide any information on how strongly the variables are related. Correlation does not fit the data or allow estimations but it does describe the strength of the relationship (Schober P., Christa B., Lothar S. A., 2018). These two analyses often go hand in hand.

10.5 Fitting Data

It is also worth noting that the plots presented above used “linear regression” to establish a line of best fit. Correlation is often used hand in hand with linear regression.

10.5.1 Linear Relationships

Correlation quantifies the strength of a linear relationship between variables whilst linear regression expresses the relationship in the form of an equation which we can use to predict the values of our dependent variable and establish a line of best fit.

When the relationship is not linear we may need to use more complex methods of fitting.

10.5.2 Non-Linear Relationships

A non-linear regression can be used when the relationship between the dependent variable and the independent variable is non-linear (when a straight line will not adequately describe the relationship between the variables). Non-linear regression is generally used when fitting a curve or a function that best describes the relationship between the variables to the data. The plot below shows a fit achieved with non-linear regression and the associated confidence bands. The data is simulated for a fictional political party and represents the percentage change in their share of the vote according to various fictional polls held between the year 2000 and 2005.

The 95% confidence bands were calculated using a technique called “bootstrapping” where the the original data set is resampled with replacement to create a larger number of new data sets allowing us to estimate the uncertainty in the fitted curve and calculate confidence intervals. This is not the only way to calculate confidence intervals and there are pros and cons to different methods.

LOWESS stands for Locally Weighted Scatterplot Smoothing and is another method used to create a smooth curve through a set of data points. It is particularly useful when there is no clear relationship between variables or when the relationship is too complex to be described by a simple linear or non-linear model.

LOWESS works by fitting a regression line to a subset of the data points. This regression line is then shifted along the data points, fitting a new regression line at each point. The final curve is the result of a weighted combination of all the regression lines.

The plot below illustrates and LOWESS fit for simulated data representing the percentage change in a fictional party’s share of the vote according to fictional polls held between the year 2000 and 2005.

The faint blue lines show the outcome of the resampling that occurs when bootstrapping.

10.6 Correlation vs. Causation

The expression that “correlation does not imply causation” is fairly well known and it is a warning against interpreting a strong correlation to imply that change in one variable directly causes the change in another. This is rarely the case. Ice cream sales and drownings are positively correlated but it is unlikely that ice cream sales are somehow driving people to drown or vice versa. Instead, it is temperature that causes a change in the other two variables. Higher temperatures cause higher ice cream sales and drives a greater number of swimmers into the sea and swimming pools (where they will be at greater risk of drowning).

References

Schober P., Christa B., Lothar S. A.,. 2018. Correlation Coefficients: Appropriate Use and Interpretation. Anesthesia & Analgesia 126(5): 1763–68. https://doi.org/10.1213/ANE.0000000000002864 .
Swinscow, T.D.V. 1997. Statistics at Square One: Correlation and Regression. 9th ed. BMJ Publishing Group. https://www.bmj.com/about-bmj/resources-readers/publications/statistics-square-one/11-correlation-and-regression.