LLMs Have Made Failure Worth Publishing
For the AI and scientific publishing communities, this paper highlights a critical but overlooked problem that could undermine the reliability of LLMs, though the argument is conceptual and lacks empirical validation.
The paper argues that the systematic absence of negative results in scientific publishing degrades the utility of large language models (LLMs) as research tools, training data consumers, and peer reviewers, and proposes protocols to validate this claim.
Scientific publishing systematically filters out negative results. We argue that this long-standing asymmetry has become an urgent problem in the era of large language models, which inherit the positive bias of the literature they are trained on, face an impending shortage of high-quality training data, and are increasingly deployed as both research tools and peer reviewers. We analyze three ways in which LLMs have changed the value of failure data and show that the systematic absence of such data degrades their utility as research tools, training data consumers, and peer reviewers alike. We outline experimental protocols to validate these claims and discuss the structural conditions under which a failure-inclusive publishing culture could emerge.