Skip to main content

Posts

Showing posts with the label reddit

Leo DiCaprio Romantic Age Gap Data: UPDATE

Does anyone else teach correlation and regression together at the end of the semester? Here is a treat for you: Updated data on Leonardo DiCaprio, his age, and his romantic partner's age when they started dating. A few years ago, there was a dust-up when a clever Redditor r/TrustLittleBrother realized that DiCaprio had never dated anyone over 25. I blogged about this when it happened. But the old data was from 2022. Inspired by this sleuthing,  I created a wee data set, including up-to-date information on his current relationship with Vittoria Ceretti, so your students can suss out the patterns that exist in this data.

r/DataIsUgly

I have found plenty of class inspiration on Reddit. Various subs have provided a  new way to explain mode   and median  and great, intuitive data to teach  correlation . However, much as a reverse-coded item on a scale can be used to get to the opposite of what you are asking about, r/DataIsUgly is rife with examples of how NOT to do data as to teach how to create good data visualizations. Very recently, I shared this example from r/DataIsUgly to illustrate why NOT to truncate the Y axis .  And...this sub is filled with people like us. People who love to proofread and notice data crimes. For example: How to use it in class? Can your students figure out why these data visualizations are...less than optimal? Can they fix them? They could be a fun prompt for extra credit points or a discussion board.

Correlation example: Taco Bell and mortality by state...don't run for the border!

Many thanks to my colleague, Andrew Caswell, for sharing this Reddit post with me: https://www.reddit.com/r/dataisbeautiful/comments/s75sm7/oc_us_life_expectancy_vs_of_taco_bell_locations/ So, this alone is an excellent example of correlation and the third variable problem. But...more delightfully, the Redditor who created this graph also shared where he found this data (https://www.nicerx.com/fast-food-capitals/, https://worldpopulationreview.com/state-rankings/life-expectancy-by-state). BETTER STILL: I downloaded and organized all of the fast-food data and mortality data and put it in one spreadsheet for you all. Do All The Correlations! Teach your students about Bonferroni corrections! Figure out the fast-food restaurant that correlates the most strongly with mortality!   PS: Did you know that there is an option to download data from a website in Excel?  The fast-food data was presented in an embedded, scrolly table, and that Excel option made it easy-peasy to do...

Daves know more Daves: A independent t-test example from Reddit

This is a beautiful story from Reddit, with a very kind Redditor, Higgnenbottoms/Quoc Tran, who shared his data with all of us, so we can use this as an example of a) independent t-tests, b) violin plots, AND R.  So, user r/quoctran98  wanted to know if Daves knew more Daves than non-Daves do. HA! He started by collecting data from r/samplesize .  Do you all know about that subreddit, where you can post a survey and see who responds? You're welcome. Anyway, Quoc analyzed his data AND created a violin plot to illustrate his data. He shared it at r/dataisbeautiful , which is another excellent stats subreddit. See below. AND...here is the kicker...I contacted Quoc, and he shared his data (so your students can run their t-tests) AND his R code . I cleaned up his data a bit to provide the same results as the graph above (he had someone report that they knew 69 Daves. I mean, he collected the data from Reddit users.).

Reddit's data_irl subreddit

You guys, there is a new subreddit just for sharing silly stats memes. It is called r/data_irl/ . The origin story is pretty amusing. I have blogged about the subreddit r/dataisbeautiful  previously. The point of this sub is to share useful and interesting data visualizations. The sub has a hard and fast rule about only posting original content or well-cited, serious content. It is a great sub. But it leaves something to be desired. That something is my deep desire to see stats jokes and memes. On April Fool's Day this year, they got rid of their strict posting rules for a day and the dataisbeautiful crowd provided lots of hilarious stats jokes, like these two I posted on Twitter: The response was so strong, because there are so many of people that love stats memes, that a new sub was started, data_irl JUST TO SHARE SILL STATS GRAPHICS. It feels like coming home to my people. 

I've tracked all my son's first words since birth [OC]

Reddit user jonjiv conducted a case study in human language development. He carefully monitored his son's speaking ability, and here is what he found: https://imgur.com/gallery/KwZ6C#qLwsn9S...go to this link for a clearer picture of the chart! How to use in class: 1) Good for Developmental Psychology. Look at that naming explosion! 2) Good to demonstrate how nerdy data collection can happen in our own lives. 3) Within versus between subject design. Instead of sampling separate 10, 11, 12, etc. month old children, we have real-time data collected from one child. AND this isn't retrospective data, either. 4) Jonjiv even briefly describes his "research methodology" in the original post. The word had to be used in a contextually appropriate manner AND observed by both him and his wife (inter-rater reliability!). He also stored his data in a Google sheet because of convenience/ease of tracking via cell phone.

u/zonination's "Got ticked off about skittles posts, so I decided to make a proper analysis for /r/dataisbeautiful [OC]"

The subreddit s/dataisbeautiful was inundated by folks creating color distributions for bags of candy. And because 1) it is Reddit and 2) stats nerds take joy in silly things, candy graphing got out of hand. See below: https://www.reddit.com/r/dataisbeautiful/comments/5bojxl/oc_the_data_suggests_that_certain_colors_are_not/ https://www.reddit.com/r/dataisbeautiful/comments/5bmo3a/color_distribution_of_one_more_partysized_bag_of/ https://www.reddit.com/r/dataisbeautiful/comments/5cmemr/a_pie_chart_of_mm_colors_from_a_single_500g_bag_oc/ And because it is Reddit, and, to be a fair, statistically unreliable, other posters would claim that this data WASN'T beautiful because it was a small sample size and didn't generalize. One bag of Skittles, they claimed. didn't tell you a lot about the underlying population of Skittles. Until Redditor zonination came along, bought 35 enormous bags of Skittles, and meticulously documented the color distribution in each ...

Matt, Rali & Rhonda's Statistical Test Flowchart.

Take a look at this interactive, statistical decision making flow chart. I think that almost every statistics text includes a flow chart, but the interactive piece of this, and its ability to immediately provide the reader with information on the appropriate analysis AND software assistant is something your students can't get from paper versions of same. The flow chart is based on Andy Field's work. I discovered this tool via Reddit. I'm including that Reddit thread because the person that created the thread (commentor4) states that they also created the flow chart. So, you are lead through a series of questions (read this from the bottom up). After you provide the necessary information, the page provides you with a quick definition of the test you should conduct as well as links to instruction using popular statistical packages.

u/dat data's "Why medians > averages [OC] "

Unsettling. But I bet your students won't forget this example of why mean isn't always the best measure of central tendency. While the reddit user labeled this as example median's superiority, you could also use this as an example when mode is useful. As statisticians, we often fall back on to mode when we have categories and median when we have outliers, but sometimes either median or mode can be useful when decimal points don't make a lot of sense. Here is the image and commentary from reddit: And this an IG posting about the data from the same user, Mona Chalabi from fivethirtyeight. I included the Instagram because Chalabi expands a bit more upon the original data she used. https://www.instagram.com/p/BIVKJrcgW51/

"Guess the Correlation" game

Found this gem, "Guess the Correlation" , via the subreddit r/statistics . The redditor who posted this resource (ow241) appears to be the creator of the website. Essentially, you view different scatter plots and try to guess r . Points are rewarded or taken away based on how close you are to true  r . The game tallies your average amount of error as well. It is way more addictive than it sounds. I think that accuracy increases with time and experience. True r for this one was .49. I guess .43, which isn't so bad. I think this is a good way for statistics instructors to procrastinate. I think it is also a good way to help your students build a more intuitive ability to read scatter plots and predict the strength of linear relationships.

r/faux_pseudo's "Distribution of particles by size from a Cracker Jack box

I love my fellow Reddit data geeks over at r/dataisbeautiful . Redditor faux_pseudo created a frequency chart of the deliciousness found in a box of Cracker Jacks. I think it would be funny to ask students to discuss why this graph is misleading (since the units are of different size and the pop corn is divided into three columns). You could also discuss why a relative frequency chart might provide a better description. Finally, you could also replicate this in class with Cracker Jacks (one box is an insufficient n-size, after all) or try it using individual servings of Trail Mix or Chex Mix or order to recreate this with a smaller, more manageable sample size. Also, as always, Reddit delivers in the Comments section:

An example of when the median is more useful than the mean. Also, Bill Gates.

From Reddit's Instagram...the comments section demonstrates some heart-warming statistical literacy.

/rustid's "What type of Reese's has the most peanut butter?"

Rustid, a Reddit redditor, performed a research study in order to determine the proportions of peanut butter contained in different types of Reese's Peanut Butter candies. For your perusal, here is the original  reddit thread  (careful about sharing this with students, there is a lot of talk about how the scales Rustid used are popular with drug dealers), photo documentation via  Imgur , and a  Buzzfeed article  about the experiment. Rustid documented the process by which he carefully extracted and measured the peanut butter content of nine different varieties of Reese's peanut butter and chocolate candies. See below for a illustration of how he extracted the peanut butter with an Exact-o knife and used electronic scales for measurements. http://imgur.com/a/wN6PH#SUhYBPx Below is a graph of the various proportions of peanut butter contained within each version of the Reese's Peanut Butter Cup. http://imgur.com/a/wN6PH#SUhYBPx This example...

Reddit for Statistics Class

I love reddit . I really love the sub-reddit r/dataisbeautiful . Various redditors contribute interesting graphs and charts from all over the interwebz. I leave you to figure out how to use these data visualizations in class. If nothing else, they are highly interesting examples of a wide variety of different graphing techniques applicable to different sorts of data sets. In addition to interesting data visualizations, there are usually good discussions (yes, good discussion in the internet!) among redditors about what is pushing the presented findings. Another facet of these posts are the sources of the data. There are many examples using archival data, like this chart that used social media to estimate sports franchise popularity , Users also share interesting data from more traditional sources, like APA data on the rates of Masters/Doctorates awarded over time and user rating data generated by IMDB ( here, look at the gender/age bias in ratings of the movie Fifty Shades of Gr...

minimaxir's "Distribution of Yelp ratings for businesses, by business category"

Yelp distribution visualization, posted by redditor minimaxir This data distribution example comes from the subreddit r/dataisbeautiful  (more on what a reddit is  here ). This specific posting (started by minimaxir) was prompted by several  histograms illustrating  customer ratings for various Yelp (customer review website) business categories as well as the lively reddit discussion in which users attempt to explain why different categories of services have such different distribution shapes  and means. At a basic level, you can use this data to illustrate skew, histograms, and normal distribution. As a more advanced critical thinking activity, you could challenge your students to think of reasons that some data, like auto repair, is skewed. From a psychometric or industrial/organizational psychology perspective, you could describe how customers use rating scales and whether or not people really understand what average is when providing customer feedba...