Thursday, July 11, 2019

Interactive NYC commuting data illustrates distribution of the sampling mean, median

Josh Katz and Kevin Quealy put together a cool interactive website to help users better understand their NYC commute. With the creation of this website, they also are helping statistics instructors illustrate a number of basic statistics lessons.

To use the website, select two stations...


The website returns a bee swarm plot, where each dot represents one day's commuting time over a 16-month sample. 


 So, handy for NYC commuters, but also statistics instructors. How to use in class:

1. Conceptual demonstration of the sampling distribution of the sample mean. To be clear, each dot doesn't represent the mean of a sample. However, I think this still does a good job of showing how much variability exists for commute time on a given day. The commute can vary wildly depending on the day when the sample was collected, but every data point is accurate. 

2. Variability. Here, students can see the variability in commuting time. I think this example is especially useful because everyone can relate to having an unexpectedly short or long commute, and the bee swarm plot does a great job of visualizing it.

3. Distribution shapes. You can ask your students to search through different commutes and look for normally distributed, right skew, left skew, uniform, bee swarm plots. 

4. Introduce your students to the beeswarm plot. Seriously, guys, stop teaching your students about stem and leaf plots and start teaching your student to interpret increasingly popular charts and graphs that do a great job of representing ALL data points.

5. Percentiles. For each route, you receive "Bad Day" and "Good Day" times. The authors defined the "Bad Days" as: 

1/20, or 5 percent of the time. Sounds like the 95th percentile to me. It would also be worth noting the that writers went with the easier to understand 1/20 instead of listing a percentile, probably because they wrote this piece for a popular source.


6. Confidence intervals in action. Note how the time for commute grows as you decrease your tolerance for being late, demonstrating that CI tension between accuracy and precision. 



Saturday, June 29, 2019

Do Americans spend $18K/year on non-essentials?

This is a fine example of using misleading statistics to try and make an argument.

USA Today tweeted out this graphic, related to some data that was collected by some firm.


There are a number of research methodology issues here.

1) False Dichotomy: When we collect data, we need to make sure that our response options are clear and mutually exclusive. I think there are two types of muddled dichotomies with this data:

a) What is "essential"?

When my kids were younger, I had an online, subscription order for diapers with Target. Those were absolutely essential and I received special discounts tied to the subscription. But they were a subscription that originated online, and therefore non-essential?

And if you use a ride-sharing app to go to work, or NOT drunk drive, that IS essential.

b) Many purchases fall into multiple categories. Did the survey creators "double-dip" as to pad each mean and push the data towards it's $18K conclusion?

Were participants clear that "drinks out with friends" and "eating out at restaurants" were two discreet categories"? What if I impulsively by a new curling iron online? Which category does this fall into? Personal grooming, impulse shopping, or online shopping? 

2) Data from an established, well-known news source is not perfect data.

This data went viral. Lots of people were exposed to this data.

3) The data assumes that all Americans use all of the products in all of these categories.

Plenty of people don't belong to a gym or ever use a ride-sharing service.

4) Conflict of interest.

The original data was collected in order to make an argument in favor of buying life insurance. Specifically, they were arguing that individuals could afford life insurance if they better budgeted, which is indeed true. However, it is problematic to frame certain expenses as an option when they are not.

5) If a person didn't use one of these services, where their "zeroes" counted towards the mean?

Monday, June 24, 2019

Pew Research's "Gender and Jobs in Online Image Searches"

You know how every few months, someone Tweets about stock photos that are generated when you Google "professor"? And those photos mainly depict white dudes? See below. Say "hi" to Former President and former law school professor Obama, coming it at #10, several slots after "novelty kid professor in lab coat".


Well, Pew Research decided to quantify this perennial Tweet, and expand it far beyond academia. They used Machine Learning to search through over 10K images depicting 105 occupations and test whether or not the images showed gender bias. 

How you can use this research in your RM class:

1. There are multiple ways to quantify and operationalize your variables. There are different ways to measure phenomena. If you read through the report, you will learn that Pew both a) compared actual gender ratios to the gender ratios they found in the pictures and b) counted how long it took until a search result returned the picture of a woman for a given job.
Quantifying difference by the sheer number of images.
Quantifying the difference by counting how long it takes to find a picture of a woman doing the job.

2. Replication outside of America: This research didn't just look at America but at 18 different countries.


3. Machine learning in research. For more detail on how they learned the machine to identify gender, see their methodology page.



4. Pew used data from the federal government for this project. The Bureau of Labor Statistics provided all of the actual gender break-down data for occupations.

As always, Pew provides the full report.