Awes, Laws, and Flaws From Today's LLM Research
This meta-analysis identifies flaws in LLM research practices, which is important for researchers and practitioners to improve rigor and reproducibility in the field.
The paper critically examines the scientific methodology in contemporary large language model (LLM) research by assessing over 2,000 works from 2020-2024, finding trends like declines in ethics disclaimers and increases in claims of reasoning abilities without human evaluation.
We perform a critical examination of the scientific methodology behind contemporary large language model (LLM) research. For this we assess over 2,000 research works released between 2020 and 2024 based on criteria typical of what is considered good research (e.g. presence of statistical tests and reproducibility), and cross-validate it with arguments that are at the centre of controversy (e.g., claims of emergent behaviour). We find multiple trends, such as declines in ethics disclaimers, a rise of LLMs as evaluators, and an increase on claims of LLM reasoning abilities without leveraging human evaluation. We note that conference checklists are effective at curtailing some of these issues, but balancing velocity and rigour in research cannot solely rely on these. We tie all these findings to findings from recent meta-reviews and extend recommendations on how to address what does, does not, and should work in LLM research.