Show Me the Stats: Correlation Coefficients
So. Correlations. They’re fairly simple creatures: they measure the relationship between variables, without making any assumptions or assertions about cause and effect, dependence, or direction. Correlation coefficients range from -1 to 1, with -1 meaning a perfect negative correlation, 1 being a perfect positive correlation, and 0 meaning no correlation at all.
The most commonly used equation for determining correlations is the Pearson coefficient, usually denoted by r. The equation can be shuffled around quite a bit, but in one common form it looks like this:

In English, this means:
- For each data point (X,Y) find the distance between X and the X mean, and the distance between Y and the Y mean. Multiply these together to get a crossproduct.
- Do this for all of the data points. Sum the results. This is your numerator.
- Separately, multiply the standard deviation of the x variable, the standard deviation of the y variable, and the total number of data points. This is your denominator.
- Divide the numerator by the denominator, as you do.
How does this tell us the relationship between two variables? To understand, we can use the concept of a regression line, although I won’t explore how to calculate one until a later post. Briefly, a regression line or line of best fit is a line drawn across your data such that there is the minimum possible distance of the data points, collectively, from the line. Another way to think of it, is that it’s the line drawn by the linear equation which best predicts x from y or vice versa. The correlation coefficient is a way of measuring how closely the data fit that ideal line.

A fake example to demonstrate, and then a calculation using political data are below the cut.
Let’s calculate a perfect positive correlation. Suppose our favorite pizza shop has an unstated but impeccably followed policy of providing customers with three napkins for every slice of pizza they buy. You and I, curious but sadly too shy to ask, try to figure out the relationship of napkins to pizza. We visit five times and get the following five observations:
| Pizzas | Napkins |
| 1 | 3 |
| 2 | 6 |
| 3 | 9 |
| 4 | 12 |
| 5 | 15 |
The mean number of pizzas is 3 (sum of 15 divided by n of 5) and the mean number of napkins is 9 (sum of 45 divided by n of 5.) So, following the steps above, we find the distance from the mean for X and Y of each data point, and square those distances. Then we multiply the x and y distance together to get a crossproduct:
| X (pizzas) | Y (napkins) | X – Mean X | Y – Mean Y | Crossproduct |
| 1 | 3 | -2 | -6 | 12 |
| 2 | 6 | -1 | -3 | 3 |
| 3 | 9 | 0 | 0 | 0 |
| 4 | 12 | 1 | 3 | 3 |
| 5 | 15 | 2 | 6 | 12 |
Our first data point (1,3) has distances of -2 and -6 respectively. Multiplied, we get a value of 12. Values for the other four observations are 3, 0, 3, and 12. Summing them we get the result 30, our numerator. Calculating the standard deviations (which we went over last time) gets us an SD of √2 for pizzas and √18 (or 3√2) for napkins. We multiply these together to get 6 (as √2 x √2 = 2) and with our n of 5 to get 30 (our denominator.) 30 divided by 30 is 1 — a perfect positive correlation.
Wikipedia – the source of all my images for this post – has a lovely little visual demonstration of what correlations can look like:

I’ve chosen to look for a correlation between the personal wealth of US Senators, and how much they get in contributions from the financial industry. Obviously there are a lot of factors that influence who people donate to – their espoused beliefs, their votes, their position on key committees, how vulnerable they are in elections and how expensive it is to run campaigns in their state. But there may also be a relationship between how much money a senator has personally, and how comfy they feel cozying up to the financial industry.
To do this demo, I took data from Open Secrets on net wealth and contributions and put it together in a spreadsheet (which you can access, if you’re so inclined.)
Starting off simply, we can calculate that the mean donation received from the financial industry is $196,615 (nearly double the mean donation to house members, which we calculated last time.) Mean wealth is $13,783,639. (This number blew my mind a little, so I checked the spreadsheet – only 18 Senators have net wealth above this mean but 8 of them have $50 million or more, with John Kerry, the wealthiest senator, having $231 million to his name. The median wealth is an ever so slightly more reasonable $2,685,511. Their wealth added all together is $1.3 billion.)
Back on track. So: to calculate the correlation we start by getting the distance between each value and the variable mean, and then multiplying them to get crossproducts. The sum of those crossproducts is our numerator, $25322938435740. Then to get the denominator, we calculate the SD for wealth and for donations by getting distance from the mean, squaring it, summing it, dividing by n (96) and taking the square root. We take the SD of each variable (wealth = $34250831, donations = $318008) and multiply by n to get the denominator. Divide the former by the latter and you get our correlation coefficient: 0.024.
Interpretation of correlation coefficients is, it turns out, fairly arbitrary. However our result is pretty obviously not a strong correlation. +/- 0.1 to 0.3 is considered a small correlation, +/- 0.3 to 0.5 is considered medium, and +/1 0.5 to 1 is considered strong. Anything less than .1 is nothing. So… we’ve found nothing.
There are a number of tests that can be done to interpret correlation coefficients and quantify their significance. However, inferential statistics are another topic for another day. (And also, it will be more fun to play around with inference and signifcance if I can find a stronger correlation next time.) So that’s all for now, and next time we’ll talk about significance testing.
Leave a Reply