My previous post about p-values has reminded me that I’ve been meaning to brush up on my statistics. I’m hoping to do this by writing a series of tutorials for this blog, and to keep things interesting for my less statistically-inclined readers, the datasets I’ll be using will be campaign finance and lobbying information from Open Secrets and Sunlight Labs. So come, join me as I learn about quantitative analysis and also the pervasive corruption of our government…
The data in this post are from .OpenSecrets.org and represent the total contributions from commercial banks to individual house members in the current (2011-2012) election cycle.
If you only know a little bit of stats, you probably know descriptive statistics. Unlike most statistics, they don’t require induction or inference – that is, they’re not about using the data you have to make guesses about the data you don’t. They are, as their name makes clear, about describing your data, about summarizing it in clear, rigorous terms.
So in this data set, we have 440 values that sum to $36,907,300.00. How is that money distributed over the individual representatives? We can look at every single value, from progressive lion and financial reform advocate John Conyer’s meager $500, to Republican Speaker of the House John Boehner’s impressive $1,232,637. Or we can try to graph our values in a histogram:
Here we can see that the data is not normally distributed – that is, it’s not your standard symmetrical bell curve. The data is skewed, with almost half of the values falling into the left-most category (contributions < 50k). Skewness is a straight-forward concept, although I find the names for skewness to be kind of confusing. A distribution like our data has, with many more values on the left than on the right, is called a right-skewed or positively-skewed distribution. The opposite – more values on the right than on the left – is a left-skewed or negatively-skewed distribution.
Averages (or: Measures of central tendency)
Everyone reading this likely knows the technical difference between the mean, median and mode. The mean is the sum of your values, divided by the count. Of the measures of central tendency, it is by far the most vulnerable to outliers. Conversely, the median (the middle of your values) and the mode (the most common of your values) tend to be more robust, though still imperfect, reflections of the average.
To get the mean, we sum the values and divide by 440, getting the result: $83,880.23
To get the median, we get the midpoint between the 220 and 221st value: $46,750
To get the mode, you simply find the most common value. In this case we have relatively few repetitions, with no values occurring no more than three times. We find ourselves with seven different modes: $2000, $8000, $9000, $10000, $19750, $33250 and $54600.
(All but one of the modes, interestingly, are in the lower half of the distribution, and all you have to get down to the 170th highest value before you find any repeats at all. This makes sense given the nature of our data, which are right-skewed.)
This is where things start to get interesting! Measures of variation, or dispersion, tell us how much diversity there is on our data. Even if we know that the “average” Representative receives about $80k in contributions, we can’t tell if the investment banks are giving around that amount to everyone, or if they’re more discriminate than that. So we turn to measures of variation.
The easiest measure to understand is range, which you can calculate by subtracting your least value from your greatest. Our range is $1,232,137. To mitigate the effects of outliers, people will frequently report the interquartile range – the difference between the 25th and 75th percentile. Since we have 440 data points, we would calculate the IQR by subtracting the 110th data point (Cliff Stearns (R-FL) – $25,498.00) from the 330th data point (Darrell Issa (D – CA) – $96,600.00). Our IQR is $71,102.00.
More complicated, but usually much more informative, is the standard deviation. To figure out the standard deviation, you must first calculate the mean. Then, take each data point and find the difference between it and the mean. For instance, our first data point, John Conyers, is $83,380.23 away from the mean. The next data point, Lynn Woosley, is $82,130.23 away from the mean. Chris Van Hollen – my parents’ Representative, and former head of the DCCC – is much closer to the mean: $1,602.77 above.
These differences are then squared. This results in data points above and below the mean being given equal weight – a representative getting $100 more than the mean ($100) and a representative getting $100 less than the mean (-$100) both have the same squared difference, as a negative x a negative is a positive. The squared differences are summed, and divided by the total number of data points (in our case, 440.) This gives you the variance. The standard deviation is the square root of the variance.
For our data set, the variance is $12,890,520,522.33 and the standard deviation is $113,536.43. This is, you might have noticed, huge – nearly half again the size of the mean. This is not surprising, given what our distribution looks like. The heavy right-skew gives us a large number of low values on the left and, even more disruptively, a long tail of high values on the right. These cancel each other out and give a reasonable mean. But the variance is looking at the absolute value of the difference, so there’s no canceling, and therefore a sky-high standard deviation.
One alternative to the standard deviation, useful for non-normal distributions like ours, is the average absolute deviation. The average absolute deviation comes in three forms – mean, median, and mode. With the mean absolute deviation, you’re calculating difference from the mean. With the median absolute deviation, you’re calculating difference from the median. Although the mean absolute deviation is more common, in our case the median absolute deviation may be more useful, as the median is less influenced by outliers than the mean.
Average absolute deviation is calculated similarly to the standard deviation, with only one change. You determine the difference between each data point and the median (or mean, or mode). You skip the step of squaring the differences (although you *do* drop the minus signs). You sum the differences and divide by the total count. For our sample, the median absolute deviation is $59,050.59 – still a high number, but much more reasonable (and at least less than the mean!)
I’m not sure about the pros and cons of using standard deviation vs mean absolute deviation. This article claims that standard deviations are ideal only for normal distributions, which can be quite rare in experimental science, and that they are over-taught and over-used. While they are more accurate and efficient in normal distributions, they are susceptible to skewness and outliers, whereas the median absolute deviation is more robust.
What we’ve learned
Distribution – frequency of values – right-skewed
Mean – sum of values divided by count – $83,880.23
Median – midpoint value – $46,750
Mode – most common value – $2000, $8000, $9000, $10000, $19750, $33250 and $54600
Crude Range – least value subtracted from greatest value – $1,232,137
Interquartile Range – 25th percentile value subtracted from 75th percentile value – $71,102.00
Variance – squares of differences from the mean divided by count – $12,890,520,522.33
Standard Deviation – square root of the squares of differences from the mean divided by count – $113,536.43
Median Absolute Deviation – absolute differences from the median divided by count – $59,050.59
Correlations! Possibly also regressions!
Finally, have a graphic. I made it just for this series of posts, and I can’t decide if it is too silly, or not silly enough…