Skip to main content

Posts

Showing posts with the label data mining

NYT's "Is It Safer to Visit a Coffee Shop or a Gym?"

Katherine Baicker ,  Oeindrila Dube ,  Sendhil Mullainathan ,  Devin Pope,  and  Gus Wezerek created an interactive, data-driven piece for NYT . It provides a new perspective on how we should proceed with re-opening businesses during the COVID-19 pandemic. They argue that we must consider 1) how long people linger in different types of stores, 2) how often they visit these stores, 3) the square footage of the stores, and 4) the amount of human interaction/surface contact associated with how we shop at different stores.  How to use this in class:    1) Show your students how data can inform real-life problems. Or crises, like how to safely re-open stores during COVID-19. 2) Show your students how data can be used in creative ways to solve problems. The present argument uses cellphone location data. 3) Show your students data viz in real life: Here, scatterplots that really improve the #scicomm potential of this piece. 4) Show your students the rese...

Incorporating Hamilton: An American Musical into your stats class.

While I was attending the Teaching Institute at APS, I attended Wind Goodfriend's talk about using case studies in the classroom. Which got me thinking about fun case studies for statistics. But not, like, the classic story about Guinness Brewery and the t-test . I want case studies that feature a regular person in a regular job who used their personal expertise to deduce from data to do something great. An example popped into my head while I was walking my dog and listening to the Hamilton soundtrack: Hercules Mulligan. Okieriete Onaodowan, portraying Hercules Mulligan in Hamilton He was a spy for America during the American Revolution. He was a tailor and did a lot of work for British military officers. This gave him access to data that he shared through a spy network to infer the timing of British military operations. Here is a better summary, from the CIA:  I like this example because he wasn't George Washington. And he wasn't Alexander Hamilton. He had t...

Daniel's "Where Slang Comes From"

I think that language is fascinating. Back when I taught developmental, I always liked to teach how babies learn to talk in sort of the same way all across the world. I like regional difference in American English (for example, swearing and regional colloquialisms ). So, I really like this research that investigates the rise and fall of slang in America. And I think it could be used in a statistics class. How to use in class? 1. Funny list of descriptive statistics. 2. Research methodology for using Google searches to answer a question. A good opening for discussion of archival data, data mining, and creating inclusion criteria for research methodology. 3. Using graphs to illustrate trends across time. This feature is interactive. 4. Further interactive features demonstrating how heat maps can be used to demonstrate state-by-state popularity over time. Here, "dank memes" peaked in April 2016 in Montana. 5. The author eye-balled the data can came up ...

Ben Schmidt's Gendered Language in Teacher Reviews

Tis the season for the end of semester teaching evaluations. And Ben Schmidt has created an interactive tool that demonstrates gender differences in these evaluations.  Enter in a word, and Schmidt's tool returns to you how frequently the word is used in Rate Your Professor  evaluations, divided up by gender and academic discipline. Spoiler alert: Men get higher ratings for most positive attributes! ...while women get higher ratings for negative attributes.  Out of class, you can use this example to feel sad, especially if you are a female professor and up for tenure. In class, this leads to obvious discussions about gender and perception and interpersonal judgments. You can also use it to discuss why the x- and y-axes were chosen. You can discuss the archival data analysis used to generate these charts. You can discuss data mining. You can discuss content analysis. You can also discuss between-group differences (gender) versus within-group differences (acade...

Shameless Self Promotion

Check out my recent publication in Teaching of Psychology. Whomp, whomp!

Davies' "Ted Cruz using firm that harvested data on millions of unwitting Facebook users"

So, this is a story of data mining and Mechanical Turk and data privacy and political campaigns. Lots of good stuff for class discussion about data privacy, applied use of data, etc..  It won't exactly teach your students how to ANOVA, but it is a good and timely discussion piece. Short version of the story: Ted Cruz's campaign hired a consulting firm (Strategic Communications Laboratories, SCL) to gather information about potential voters. They did so by using Amazon's Mechanical Turk to recruit participants. Participants were asked to complete a survey that would give SCL access to your Facebook account. SCL would then download all visible user information from you. And then they would download the same information FROM ALL OF YOUR FRIENDS who did not consent to be involved in the study. Some mTurk users claim this was a violation of Amazon's Terms of Service. This data was then used to create psychological profiles for campaigning purposes. Discussion pieces: ...

Aarti Shahani's "How will the next president protect our digital lives?"

I think that it is so, so important to introduce statistics students to the big picture of how data is used in their every day lives. Even with all of the material that we are charged with covering in introduction to statistics, I think it is still important to touch on topics like Big Data and Data Mining in order to emphasize to our students how ubiquitous statistics are in our lives.  In my honors section, I assign multiple readings (news stories, TED talks, NPR stories) prior to a day of discussion devoted to this topic. In my non-honors sections of statistics and my online sections, I've used electronic discussion boards to introduce the topic via news stories. I also have a manuscript in press that describes a way to introduce very basic data mining techniques in the Introduction to Statistics classroom. That's why I think this NPR news story is worth sharing. Shahani describes and provides data (from Pew) to argue that Americans are worried about the security of...

Izadi's "Tweets can better predict heart disease rates than income, smoking and diabetes, study finds"

Elahe Izadi, writing for the Washington Post, did a report on this article by Eichstaedt et. al, (2015) . The original research analyzed tweet content for hostility and noted the location of the tweet. Data analysis found a positive correlation between regions with lots of angry tweets and the likelihood of dying from a heart attack. The authors of the study note that the median age of Twitter users is below that of the general population in the United States. Additionally, they did not use a within-subject research design. Instead, they argue that patterns in hostility in tweets reflect on underlying hostility of a given region. An excellent example of data mining, health psychology, aggression, research design, etc. Also, another example of using Twitter, specifically, in order to engage in public health research ( see this previous post detailing efforts to use Twitter to close down unsafe restaurants ).

Facebook Data Science's "What are we most thankful for?"

Recently, a Facebook craze asked users to list three things you are thankful for for five days. Data scientis ts Winter Mason, Funda Kivran-Swaine,  Moira Burke, and Lada Adamic  at Fa cebook have analyzed this dat a to better understand the patterns of gratitude publically shared by Facebook users. The data analysts broke down data by most frequently listed gratitude topic: Most frequently "liked" gratitude posts: (lots of support for our friends in recovery, which is nice to see). Gender differences in gratitude...here is data for women. The wine gratitude finding for women was not present in the data for men. Ha. Idiosyncratic data by state. I would say that Pennsylvania's fondness for country music rings true for me. How to use in class: This example provides several interesting, easy to read graphs, and the graphs show how researchers can break down a single data set in a variety of interesting ways (by gender, by age, by state). Add...

Diane Fine Maron's "Tweets identify food poisoning outbreaks"

This Scientific American podcast by Diane Fine Maron describes how the Chicago Department of Public Health (CDPH) used Twitter data to shut down restaurants with health code violations. Essentially, the CDPH monitored Tweets in Chicago, searching for the words "food poisoning". When such a tweet was identified, an official at CDPH messaged the Twitterer in question with a link to an official complain form website. The results of this program? "During a 10-month stretch last year, staff members at the health agency responded to 270 tweets about “food poisoning.” Based on those tweets, 193 complaints were filed and 133 restaurants in the city were inspected. Twenty-one were closed down and another 33 were forced to fix health violations. That’s according to a study in the journal  Morbidity and Mortality Weekly Report.  [Jenine K. Harris et al,  Health Department Use of Social Media to Identify Foodborne Illness — Chicago, Illinois, 2013–2014 ]" I think this is ...

Nell Greenfieldboyce's "Big Data peeks at your medical records to find drug problems"

NPR's Nell Greenfieldboyce (I know, I thought it would be hyphenated as well) reports on Mini-Sentinel , an effort by the government to detect adverse side effects associated with prescription drugs as quickly as possible. Specifically, instead of waiting for doctors to voluntarily report adverse effects, they are mining data from insurance companies in order to detect side effects and illnesses being experienced by people on prescription drugs. Topics covered by this story that may apply to your teaching: 1) Big data 2) Big data solving health problems 3) Data and privacy issues 4) Conflict of interest 5) An example of the federal government pouring lots of money into statistics to make the world a little safer 6) An example of a data and statistics being used in not-explicitly-statsy-data fields and occupations

Jess Hartnett's presentation at the 2014 APS Teaching Institute

Hi! Here is my presentation from APS . I am posting it so that attendees and everyone else can have access to the links and examples I used. If you weren't there for the presentation, a warning: It is text-light, so there isn't much of a narrative to follow but there are plenty of links and ideas and some soon-to-be-published research ideas to explore. Shoot me an email (hartnett004@gannon.edu) if you have any questions. ALSO: In the talk I reference the U.S. Supreme Court case Hall v. Florida ( also did a blog entry about this case ). Update: The court decided in the favor of Hall/seemed to understand standard error/made it a bit harder to carry out the death penalty, as discussed here by Slate). Woot woot!

Washington Post's "What your beer says about your politics"

Robinson & Feltus, 2014 There appears to be a connection between political affiliation, likelihood to vote, and preferred adult beverage. If you lean right and drink Cabernet Savignon, you are more likely to vote than one who enjoys "any malt liquor" and leans left.  This Washington Post story summarizes data analysis performed by the  National Media Research Planning and Placement . NMRPP got their data from market research firm Scarborough . There is also a video embedded in the Washington Post story that summarizes the main findings. I think this is a good example of illustrating data as well as data mining pre-existing data sets for interesting trends. And beer.

The Atlantic's "Congratulations, Ohio! You Are the Sweariest State in the Union"

While it isn't hypothesis driven research  data, this data was collected to see which states are the sweariest. The data collection itself is interesting and a good, teachable example. First, the article describes previous research that looked at swearing by state (typically, using publicly available data via Twitter or Facebook). Then, they describe the data collection used for the current research: " A new map, though, takes a more complicated approach. Instead of using text, it uses data gathered from ... phone calls. You know how, when you call a customer service rep for your ISP or your bank or what have you, you're informed that your call will be recorded?  Marchex Institute , the data and research arm of the ad firm Marchex,  got ahold of the data that resulted from some recordings , examining more than 600,000 phone calls from the past 12 months—calls placed by consumers to businesses across 30 different industries. It then used call mining technology to isola...

Burr Settles's "On “Geek” Versus “Nerd”"

Settles decided to investigate the difference between being a nerd and being a geek via a pointwise mutual association analysis (using archival data from Twitter). Specifically, he measured the association/closeness between various hashtag descriptors (see below) and the words nerd and geek. Settles provides a nice description of his data collection and analysis on his blog. A good example of archival data use as well as PMA.

io9's "New statistics on lightning deaths in the U.S. reveal weird patterns"

According to this data from the National Weather Service , lightning is a big, man-hating jerk!   From NWS/NOAA   And Might Thor lives to be your weekend's buzz kill! Or not. Play "Spot the Third Variable" with your students.

University of Cambridge's Facebook Research

University of Cambridge's Psychometric Center has used statistics to make make personality predictions based upon an individual's Facebook "likes" . For instance, your likes can be used to create your Big Five personality trait profile. Your students can have their data FB "likes" analyzed at YouAreWhatYouLike.com  as to determine their Big Five traits. After your students complete the FB version of the scale, you could have your students complete a more traditional paper and pencil version of the inventory and discuss differences/similarities/concurrent validity between the two measures. Below, I've included a screen grab of my FB-derived Big Five rating from YouAreWhatYouLike.com. Note: Yes, that is how I score on more traditional versions of the same scale. Generated at YouAreWhatYouLike.com In addition to Big Five prediction, the researchers also used the "like" data to make predictions of other qualities, like sexual orientatio...

CNN's History of the Super Bowl: By the numbers

Seems appropriate. I like football, but I LOVE data. For a better look in case you don't have bionic eyes or a magnifying glass next to your screen, check out CNN for the original graphics.  Given the trends within the points spread, it looks like the games are becoming more competitive over time. And the linebackers are getting scarier over time.