This example starts with a chi-square but ends with a lesson on how even well-written prompts can result in hallucinations.
A research study counted how often ChatGPT made up citations for three different categories of mental disorders (binge eating, body dysmorphic, and major depressive). They used a chi-square to determine if rates of made up citations differed by disorder (they do). If ever there was an article that belonged on this blog, this is it. You can use it in your stats class as an example of chi-square and/or as a warning to students if you ask them to perform literature reviews for your class. The original paper, Influence of topic familiarity and prompt specificity on citation fabrication in mental health research using large language models: Experimental Study was published in December 2025, and summarized by PsyPost shortly after publishing. What the researchers did: What the researchers found: How to use in class: 1. This is a good chi-square results section. They shared the test value and the p value, of course, but I like how they shared the varying rates of inaccuracy...