Caution Advised in LLM use for scientific studies

CronosTempi · May 14, 2025, 12:19pm

And newer editions actually perform worse.

Generalization bias in large language model summarization of scientific research

https://royalsocietypublishing.org/doi/10.1098/rsos.241776

Popsci writeup:

Prominent chatbots routinely exaggerate science findings, study shows

“ When summarizing scientific studies, large language models (LLMs) like ChatGPT and DeepSeek produce inaccurate conclusions in up to 73% of cases, according to a study by Uwe Peters (Utrecht University) and Benjamin Chin-Yee (Western University, Canada/University of Cambridge, UK). The researchers tested the most prominent LLMs and analyzed thousands of chatbot-generated science summaries, revealing that most models consistently produced broader conclusions than those in the summarized texts. Surprisingly, prompts for accuracy increased the problem and newer LLMs performed worse than older ones.

The study evaluated how accurately ten leading LLMs, including ChatGPT, DeepSeek, Claude, and LLaMA, summarize abstracts and full-length articles from top science and medical journals (e.g., Nature, Science, and The Lancet). Testing LLMs over one year, the researchers collected 4,900 LLM-generated summaries.

Six of ten models systematically exaggerated claims found in the original texts, often in subtle but impactful ways; for instance, changing cautious, past-tense claims like “The treatment was effective in this study” to a more sweeping, present-tense version like “The treatment is effective.” These changes can mislead readers into believing that findings apply much more broadly than they actually do.

Accuracy prompts backfired

Strikingly, when the models were explicitly prompted to avoid inaccuracies, they were nearly twice as likely to produce overgeneralized conclusions than when given a simple summary request.

“This effect is concerning,” Peters said. “Students, researchers, and policymakers may assume that if they ask ChatGPT to avoid inaccuracies, they’ll get a more reliable summary. Our findings prove the opposite.”“