DLApr 18

Do Large Language Models know Which Published Articles have been Retracted?

arXiv:2604.1687274.5h-index: 10

AI Analysis

For researchers and clinicians relying on LLMs for literature review, this paper highlights a critical blind spot where LLMs fail to detect retracted studies, potentially propagating invalid findings.

The study tested three open-weight LLMs on their ability to identify retracted articles, finding that over 80% of retracted articles were incorrectly claimed as not retracted (82-88% failure rate), while false retraction claims for non-retracted articles were rare (0.16-0.23%).

Large Language Models (LLMs) can be helpful for literature search and summarisation, but retracted articles can confuse them. This article asks three open weights (offline) LLMs whether 161 high profile retracted articles had been retracted, performing a similar check for a benchmark multidisciplinary set of 34,070 non-retracted articles. Based on titles and abstracts, in over 80% of cases the LLMs claimed that a retracted article had not been retracted (GPT OSS 120B: 82%; Gemma 3 27B: 84%; DeepSeek R1 72B: 88%). The reasons given for a correct retraction declaration were often wrong, even if detailed. This confirms that LLMs have little ability to distinguish between valid and retracted studies, unless they are allowed to, and do, check online. For the benchmark test, there were only 55 false retraction claims from 34,070 non-retracted full text articles, and 28 false claims when only the title and abstract were entered, suggesting that there is only a small chance that LLMs discount valid studies. When retractions are erroneously claimed, this does not seem to be due to mistakes in the article. Overall, the results give new reasons to be cautious about LLM claims about academic findings.

View on arXiv PDF

Similar