This example starts with a chi-square but ends with a lesson on how even well-written prompts can result in hallucinations.
A research study counted how often ChatGPT made up citations for three different categories of mental disorders (binge eating, body dysmorphic, and major depressive). They used a chi-square to determine if rates of made up citations differed by disorder (they do).
If ever there was an article that belonged on this blog, this is it. You can use it in your stats class as an example of chi-square and/or as a warning to students if you ask them to perform literature reviews for your class.
The original paper, Influence of topic familiarity and prompt specificity on citation fabrication in mental health research using large language models: Experimental Study was published in December 2025, and summarized by PsyPost shortly after publishing.
What the researchers found:
How to use in class:
1. This is a good chi-square results section. They shared the test value and the p value, of course, but I like how they shared the varying rates of inaccuracy as absolute data and percentages throughout. Chi-squares can be tricky to present in text (versus a table) and the authors did a good job here.
2. If you are talking to your students about proper use of AI: These researchers shared their exact prompts in their supplemental material. This demonstrates a) proper, ethical citations of prompts when using AI in research and that b) the well-written prompts still resulted in bogus data.
Comments
Post a Comment