Monday, December 26, 2016

u/zonination's "Got ticked off about skittles posts, so I decided to make a proper analysis for /r/dataisbeautiful [OC]"

The subreddit s/dataisbeautiful was inundated by folks creating color distributions for bags of candy. And because 1) it is Reddit and 2) stats nerds take joy in silly things, candy graphing got out of hand. See below:

And because it is Reddit, and, to be a fair, statistically unreliable, other posters would claim that this data WASN'T beautiful because it was a small sample size and didn't generalize. One bag of Skittles, they claimed. didn't tell you a lot about the underlying population of Skittles.

Until reditor zonination came along, bought 35 enormous bags of Skittles, and went to work meticulously documenting the color distribution in each bag. And he used R. And he created multiple data visualizations. See below. Here is the reddit post, and here is his Imgur gallery with visualizations and a narrative describing his findings. (Y'all, I know Reddit has a bad reputation at times, but the discussion in this posting is hilarious if you are a stats nerd. Check it out.).

He explained his data with a heat map...
And a stacked bar chart, that really illustrates outlier bags 15 and 16. Imagine if you mistakenly tried to generalize from of those bags?
And he presents the increasingly popular Violin plot. well as the perpetual favorite, a bar chart with error bars.

AND...he shared his data and R code with the world.

How to use in class:
-Discuss proper sample sizes required in order to generalize to a population. I think rouge bags 15 and 16 are especially effective at demonstrating sample error.
-Your students understand the concept of Skittles. Therefore, they will be able to understand the nuances of these different kinds of data visualizations.
-Buy your students some Skittles and replicate.
-Data and code available to play around with.