Skip to main content

Posts

Showing posts with the label twitter

Sampling bias example via NASA, Pew Research Center, and Twitter

Today's post is one, small, to-the-point example of sampling bias. On May 27, 2020, my family and I were awaiting lift-off for the (subsequently grounded) NASA/SpaceX launch. To no one's surprise, I was following NASA on Twitter during the hoopla, and I noticed this Tweet: https://twitter.com/NASA/status/1265724481009594369 And I couldn't help but think: That is some sampling bias. Admittedly, their sample size is very impressive, with over 54K votes. But this poll went out to a bunch of people who love NASA so much that they follow it on Twitter.  What is a less biased response to this question? As always, Pew Research Center had my back. 58% of Americans responded that they definitely/probably weren't interested in traveling into space: https://www.pewresearch.org/fact-tank/2018/06/07/space-tourism-majority-of-americans-say-they-wouldnt-be-interested/ If you want to expand upon this example in class, you could ask your students to Google around for information on the ...

Daily Cycles in Twitter Content: Psychometric Indicators

Here is a YouTube video that summarizes some research findings . The researchers looked at Tweets in order to study how are focus and emotions change with our sleep/wake cycles. And the findings are interesting and not terribly surprising. Folks are mellow and rational in the morning and contemplate their mortality at 2 AM. Make money, get paid. And THIS is why I go to bed by 9 AM. I don't need to think about death at 2:20 AM. How to use in class: 1) Archival data (via Tweet) to explore human emotion. 2) What are the shortcomings of this sample method. To be sure, their data set is ENORMOUS, but how are Twitter users different from other people? Do your students think these findings would hold for people who work the night shift? 3) Go back to the original paper and look more closely at the findings: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0197002 4) This data represents one of the ways that researchers collect real-time information ...

Hedonometer.org

The Hedonometer measures the overall happiness of Tweets on Twitter. It provides a simple, engaging example for  Intro Stats since the data is graphed over time, color-coded for the day of the week, and interactive. I think it could also be a much deeper example for a Research Methods class as the " About " section of the website reads like a journal article methods section, in so much that the Hedonometer creators describe their entire process for rating Tweets. This is what the basic table looks like. You can drill into the data by picking a year or a day of the week to highlight. You can also use the sliding scale along the bottom to specify a time period. The website is also kept very, very up to date, so it is also a very topical resource. Data for white supremacy attack in VA In the pages "About" section, they address many methodological questions your students might raise about this tool. It is a good example for the process researchers go ...

Sonnad and Collin's "10,000 words ranked according to their Trumpiness"

I finally have an example of Spearman's rank correlation to share. This is a political example, looking at how Twitter language usage differs in US counties based upon the proportion of votes that Trump received. This example was created by  Jack Grieves , a linguist who uses archival Twitter data to study how we speak. Previously, I blogged about his work that analyzed what kind of obscenities are used in different zip codes in the US . And he created maps of his findings, and the maps are color coded by the z-score for frequency of each word. So, z-score example. Southerners really like to say "damn". On Twitter, at least. But on to the Spearman's example. More recently, he conducted a similar analysis, this time looking for trends in word usage based on the proportion of votes Trump received in each county in the US. NOTE: The screen shots below don't do justice to the interactive graph. You can cursor over any dot to view the word as well as the cor...

Climate change deniers misrepresent data and get called out

 Here is another example of how data visualizations can be accurate AND misleading. I Fucking Love Science broke down a brief Twitter war that started after National Review tweeted the following post in order to argue that global climate change isn't a thing. Note: The y-axis ranged from 110 - -10 degrees Fahrenheit. True, such a temperature range is experienced on planet Earth, but using such an axis distracts from the slow, scary march that is global climate change and doesn't do a very good job of illustrating how discrete changes in temperature map onto increased use of fossil fuels in the increasingly industrialized world. Twitter-verse responded thusly:

Dayna Evans "Do You Live in a "B@%$#" or a "F*%&" State? American Curses, Mapped"

Warning: This research and story include every paint-peeling obscenity in the book. Caution should be used when opening up these links on your work computer and you should really think long an hard before providing these links to your students. However, the research I'm about to describe 1) illustrates z-scores and 2) investigated regional usage of safe-for-the-classroom words like darn, damn, and gosh. So, a linguist, Dr. Jack Grieve  decided to use Twitter data to map out the use of different obscenities by county of the United States. Gawker picked up on this research and created a story about it . How can this be used in a statistics class? In order to quantify greater or lesser use of different obscenities, he created z-scores by county and illustrated the difference via a color-coding system. The more orange, the higher the z-score for a region (thus, greater usage) while blue indicates lesser usage. And, there are three such maps (damn, darn, and gosh) that are safe for us...

Izadi's "Tweets can better predict heart disease rates than income, smoking and diabetes, study finds"

Elahe Izadi, writing for the Washington Post, did a report on this article by Eichstaedt et. al, (2015) . The original research analyzed tweet content for hostility and noted the location of the tweet. Data analysis found a positive correlation between regions with lots of angry tweets and the likelihood of dying from a heart attack. The authors of the study note that the median age of Twitter users is below that of the general population in the United States. Additionally, they did not use a within-subject research design. Instead, they argue that patterns in hostility in tweets reflect on underlying hostility of a given region. An excellent example of data mining, health psychology, aggression, research design, etc. Also, another example of using Twitter, specifically, in order to engage in public health research ( see this previous post detailing efforts to use Twitter to close down unsafe restaurants ).

Diane Fine Maron's "Tweets identify food poisoning outbreaks"

This Scientific American podcast by Diane Fine Maron describes how the Chicago Department of Public Health (CDPH) used Twitter data to shut down restaurants with health code violations. Essentially, the CDPH monitored Tweets in Chicago, searching for the words "food poisoning". When such a tweet was identified, an official at CDPH messaged the Twitterer in question with a link to an official complain form website. The results of this program? "During a 10-month stretch last year, staff members at the health agency responded to 270 tweets about “food poisoning.” Based on those tweets, 193 complaints were filed and 133 restaurants in the city were inspected. Twenty-one were closed down and another 33 were forced to fix health violations. That’s according to a study in the journal  Morbidity and Mortality Weekly Report.  [Jenine K. Harris et al,  Health Department Use of Social Media to Identify Foodborne Illness — Chicago, Illinois, 2013–2014 ]" I think this is ...

Burr Settles's "On “Geek” Versus “Nerd”"

Settles decided to investigate the difference between being a nerd and being a geek via a pointwise mutual association analysis (using archival data from Twitter). Specifically, he measured the association/closeness between various hashtag descriptors (see below) and the words nerd and geek. Settles provides a nice description of his data collection and analysis on his blog. A good example of archival data use as well as PMA.