Skip to main content

Posts

Showing posts with the label archival data

Harry Enten's "Has the snow finally stopped?"

This article and figure from Harry Enten (reporting for fivethrityegiht) provides informative and horrifying data on the median last day of measurable snow in different cities in America. (Personally, I find it horrifying because my median last day of measurable snow isn't until early April). This article provides easy-to-understand examples of percentiles, interquartile range, use of archival data, and median. Portland and Dallas can go suck an egg.

Chemi & Giorgi's "The Pay-for-Performance Myth"

UPDATE: The link listed below is currently not working. I've talked to Ariana Giorgi about this, and she is working to get her graph up and running again via Bloomberg. She was kind enough to provide me with a provide me with alternate URLs to the interactive scatter plot  as well as a link to the original text of the story . Ariana is doing a lot of interesting work with data visualizations, follow her on Twitter or hit up her website . _______________________________________________________________________________ This scatter plot (and accompanying news story from Bloomberg News)  demonstrates what a non-existent linear relationship looks like. The data plots CEO pay on the x-axis and stock market return for that CEO's organization on the y-axis. I could see where this graph would also be useful in an I/O course in discussions of (wildly unfair) compensation, organizational justice, etc. http://www.bloomberg.com/bw/articles/2014-07-22/for-ceos-correlation...

Diane Fine Maron's "Tweets identify food poisoning outbreaks"

This Scientific American podcast by Diane Fine Maron describes how the Chicago Department of Public Health (CDPH) used Twitter data to shut down restaurants with health code violations. Essentially, the CDPH monitored Tweets in Chicago, searching for the words "food poisoning". When such a tweet was identified, an official at CDPH messaged the Twitterer in question with a link to an official complain form website. The results of this program? "During a 10-month stretch last year, staff members at the health agency responded to 270 tweets about “food poisoning.” Based on those tweets, 193 complaints were filed and 133 restaurants in the city were inspected. Twenty-one were closed down and another 33 were forced to fix health violations. That’s according to a study in the journal  Morbidity and Mortality Weekly Report.  [Jenine K. Harris et al,  Health Department Use of Social Media to Identify Foodborne Illness — Chicago, Illinois, 2013–2014 ]" I think this is ...

Nell Greenfieldboyce's "Big Data peeks at your medical records to find drug problems"

NPR's Nell Greenfieldboyce (I know, I thought it would be hyphenated as well) reports on Mini-Sentinel , an effort by the government to detect adverse side effects associated with prescription drugs as quickly as possible. Specifically, instead of waiting for doctors to voluntarily report adverse effects, they are mining data from insurance companies in order to detect side effects and illnesses being experienced by people on prescription drugs. Topics covered by this story that may apply to your teaching: 1) Big data 2) Big data solving health problems 3) Data and privacy issues 4) Conflict of interest 5) An example of the federal government pouring lots of money into statistics to make the world a little safer 6) An example of a data and statistics being used in not-explicitly-statsy-data fields and occupations

Quoctrung Bui's "Who's in the office? The American workday in one graph"

Credit: Quoctrung Bui/NPR Bui, reporting for NPR, shares  interactive graphs that demonstrate when people in different career fields are at the office. Via drop-down menus, you can compare the standard workdays of a variety of different fields (here, "Food Preparation and Serving" versus "All Jobs"). If you scoff at pretty visualizations and want to sink your teeth into the data yourself, may I suggest the original government report entitled, " American Time Use Survey " or a related publication by Kawaguci, Lee, & Hamermesh, 2013 . Demonstrates: Biomodal data, data distribution, variability, work-life balance, different work shifts.

So I wrote a book: Shameless self-promotion 4

When I'm not busy thinking about statistics and research methods, I like to think about positive psychology. I like to think about it so much that I co-authored a positive psychology book with Rich Walker (Winston-Salem State University) and Cory Scherer (Penn State - Schuylkill). The book is called Pollyanna's Revenge and published by Kendall-Hunt . And the book makes a case for the fact that (contrary to many pop-psych reports) there are many good side effects to being a Pollyanna and that our minds engage in all manner on non-conscious processes that help us maintain positive affect (with special attention paid to the role of the Fading Affect Bias and memory in maintaining good moods). As I am wont to do, I have started a blog and twitter for the book. This week's posting, all about positive psychology data repositories (with plenty of downloadable data that can be used in the classroom, cha-ching), can be found at the Pollyanna's Revenge blog .  Cross...

minimaxir's "Distribution of Yelp ratings for businesses, by business category"

Yelp distribution visualization, posted by redditor minimaxir This data distribution example comes from the subreddit r/dataisbeautiful  (more on what a reddit is  here ). This specific posting (started by minimaxir) was prompted by several  histograms illustrating  customer ratings for various Yelp (customer review website) business categories as well as the lively reddit discussion in which users attempt to explain why different categories of services have such different distribution shapes  and means. At a basic level, you can use this data to illustrate skew, histograms, and normal distribution. As a more advanced critical thinking activity, you could challenge your students to think of reasons that some data, like auto repair, is skewed. From a psychometric or industrial/organizational psychology perspective, you could describe how customers use rating scales and whether or not people really understand what average is when providing customer feedba...

Patti Neighmond's "What is making us fat: Is it too much food or moving to little?"

This NPR story by Patti Neighmond is about determining the underlying cause of U.S. obesity epidemic. As the name of the segment states, it seems to come down to food consumption and exercise, but which is the culprit? This is a good example for research methods because it describes methodology for examining both sides of this question. The methodology used also provides good examples of archival data usage.

Five Lab's Big Five Personality Predictor

Five.com created an app to predict you score on the Big Five by analyzing your FB status updates. five.com's prediction via status update It might be fun to have students use this app to measure their Big Five and then compare those findings to the  youarewhatyoulike.com app ( which I previously discussed on this blog ), which predicts your scores on the Big Five based on what you "Like" on FB. youarewhatyoulike.com's prediction via "Likes" As you can see, my "Likes" indicate that I am calm and relaxed but I am a neurotic status updater (crap...I'm that guy!). By contrasting the two, you could discuss reliability, validity, how such results are affected by social desirability, etc. Furthermore, you could also have your students take the original scale and see how it stacks up to the two FB measures. Note: If you ask your students to do this, they will have to give these apps access to a bunch of their personal informat...

Nate Silver and Allison McCann's "How to Tell Someone’s Age When All You Know Is Her Name"

Nate Silver and Allison McCann (reporting for Five Thirty Eight, created graphs displaying baby name popularity over time.  The data and graphs can be used to illustrate bimodality, variability, medians, interquartile range, and percentiles. For example, the pattern of popularity for the name Violet illustrates bimodality and illustrates why measures of central tendency are incomplete descriptors of data sets: "Other names have unusual distributions. What if you know a woman — or a girl — named Violet? The median living Violet is 47 years old. However, you’d be mistaken in assuming that a given Violet is middle-aged. Instead, a quarter of Violets are older than 78, while another quarter are younger than 4. Only about 4 percent of Violets are within five years of 47." Relatedly, bimodality (resulting from the current trend of giving classic, old-lady names to baby girls) can result in massive variability for some names... ...versus trendy baby names th...

Priceonomic's Hipster Music Index

This tongue-in-cheek  regression analysis found a way to predict the "Hipster Music Index" of a given artist by plotting # of Facebook shares of said artist's Pitchfork magazine review on they y-axis and Pitchfork magazine review score on the x-axis. If an artist falls above the linear regression line, they aren't "hipster". If they fall below the line, they are. For example, Kanye West is a Pitchfork darling but also widely shared on FB, and, thus demonstrating too much popular appeal to be a hipster darling (as opposed to Sun Kill Moon (?), who is beloved by both Pitchfork but not overly shared on FB). As instructors, we typically talk about the regression line as an equation for prediction, but Priconomics uses the line in a slightly different way in order to make predictions. Also, if you go to the source article, there are tables displaying the difference between the predicted Y-value (FB Likes) for a given artist versus the actual Y-value, which coul...

Tyler Vigen's Spurious Correlations

Tyler Vigen has has created  a long list of easy-to-paste-into-a-powerpoint graphs that illustrate that correlation does not equal causation. For instance, while per capita consumption of cheese and number of people who die by become tangled in their bed sheets may have a strong relationship (r = 0.947091), no one is saying that cheese consumption leads to bed sheet-related death. Although, you could pose The Third Variable question to your students for some of these relationships). Property of Tyler Vigens, http://i.imgur.com/OfQYQW8.png Vigen has also provided a menu of frequently used variables (deaths by tripping, sunlight by state) to help you look for specific examples. This portion is interactive, as you and your students can generate your own graphs. Below, I generated a graph of marriage rates in Pennsylvania and consumption of high fructose corn syrup. Generated at http://www.tylervigen.com/

Kevin Wu's Graph TV

UPDATE! This website is not currently available.  Kevin Wu's Graph TV  uses individual episode ratings (archival data via IMDB ) of TV shows, graphs each episode over the course of a series via scatter plot, and generates a regression line. This demonstrates fun with archival data as well as regression lines and scatter plots. You could also discuss sampling, in that these ratings were provided by IMDB users and, presumably, big fans of the shows (and whether or not this constitutes representative sampling). The saddest little purple dot is the episode Black Market. Truth!

Washington Post's "What your beer says about your politics"

Robinson & Feltus, 2014 There appears to be a connection between political affiliation, likelihood to vote, and preferred adult beverage. If you lean right and drink Cabernet Savignon, you are more likely to vote than one who enjoys "any malt liquor" and leans left.  This Washington Post story summarizes data analysis performed by the  National Media Research Planning and Placement . NMRPP got their data from market research firm Scarborough . There is also a video embedded in the Washington Post story that summarizes the main findings. I think this is a good example of illustrating data as well as data mining pre-existing data sets for interesting trends. And beer.

The Atlantic's "Congratulations, Ohio! You Are the Sweariest State in the Union"

While it isn't hypothesis driven research  data, this data was collected to see which states are the sweariest. The data collection itself is interesting and a good, teachable example. First, the article describes previous research that looked at swearing by state (typically, using publicly available data via Twitter or Facebook). Then, they describe the data collection used for the current research: " A new map, though, takes a more complicated approach. Instead of using text, it uses data gathered from ... phone calls. You know how, when you call a customer service rep for your ISP or your bank or what have you, you're informed that your call will be recorded?  Marchex Institute , the data and research arm of the ad firm Marchex,  got ahold of the data that resulted from some recordings , examining more than 600,000 phone calls from the past 12 months—calls placed by consumers to businesses across 30 different industries. It then used call mining technology to isola...

Burr Settles's "On “Geek” Versus “Nerd”"

Settles decided to investigate the difference between being a nerd and being a geek via a pointwise mutual association analysis (using archival data from Twitter). Specifically, he measured the association/closeness between various hashtag descriptors (see below) and the words nerd and geek. Settles provides a nice description of his data collection and analysis on his blog. A good example of archival data use as well as PMA.

Joshua Katz's visualizations of American dialect data (edited 11/30)

I love American dialects. There might be a Starbuck's in every city, but our regions are still uniquely identifiable by the way we talk. Joshua Katz (graduate student in Statistics) at NCS created graphical representations of data from Cambridge that identified dialectical differences in how Americans speak. Here is a story about the maps and here are the maps themselves . AND: You can even take the Dialect Similarity Quiz that tells you (via map) what parts of the country tend to have language patterns like your own. I think this demonstrates that 1) graphs are interesting ways of conveying information, 2) data being used to make predictions (of what portion of the U.S. you hail from), and 3) statisticians and social sciences gather interesting and varied data. Mmmmmmmmmmmmmmmmm...hoagies... Edited to add: The Atlantic has a created a video that contains the audio of folks providing examples of their awesome accents whilst completing the original surve.

io9's "New statistics on lightning deaths in the U.S. reveal weird patterns"

According to this data from the National Weather Service , lightning is a big, man-hating jerk!   From NWS/NOAA   And Might Thor lives to be your weekend's buzz kill! Or not. Play "Spot the Third Variable" with your students.

University of Cambridge's Facebook Research

University of Cambridge's Psychometric Center has used statistics to make make personality predictions based upon an individual's Facebook "likes" . For instance, your likes can be used to create your Big Five personality trait profile. Your students can have their data FB "likes" analyzed at YouAreWhatYouLike.com  as to determine their Big Five traits. After your students complete the FB version of the scale, you could have your students complete a more traditional paper and pencil version of the inventory and discuss differences/similarities/concurrent validity between the two measures. Below, I've included a screen grab of my FB-derived Big Five rating from YouAreWhatYouLike.com. Note: Yes, that is how I score on more traditional versions of the same scale. Generated at YouAreWhatYouLike.com In addition to Big Five prediction, the researchers also used the "like" data to make predictions of other qualities, like sexual orientatio...

Shameless self-promotion

Here is a publication  from Teaching of Psychology in which I outline not one, not two, not three, but FOUR free/cheap internet based activities to be used in statistics/research methods classes. (If you have access to ToP publications, you can also get it here .)