CL AI HCJul 10, 2024

CiteME: Can Language Models Accurately Cite Scientific Claims?

Ori Press, Andreas Hochlehnert, Ameya Prabhu, Vishaal Udandarao, Ofir Press, Matthias Bethge

arXiv:2407.12861v217.346 citationsh-index: 21Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of information overload for researchers by enabling automated verification of claims, though it is incremental as it builds on existing LM capabilities.

The paper tackles the problem of verifying and attributing scientific claims by introducing a benchmark, CiteME, to evaluate language models' ability to identify referenced papers from text excerpts, finding that frontier LMs achieve only 4.2-18.5% accuracy compared to humans at 69.7%, and they close this gap with CiteAgent, which reaches 35.3% accuracy.

Thousands of new scientific papers are published each month. Such information overload complicates researcher efforts to stay current with the state-of-the-art as well as to verify and correctly attribute claims. We pose the following research question: Given a text excerpt referencing a paper, could an LM act as a research assistant to correctly identify the referenced paper? We advance efforts to answer this question by building a benchmark that evaluates the abilities of LMs in citation attribution. Our benchmark, CiteME, consists of text excerpts from recent machine learning papers, each referencing a single other paper. CiteME use reveals a large gap between frontier LMs and human performance, with LMs achieving only 4.2-18.5% accuracy and humans 69.7%. We close this gap by introducing CiteAgent, an autonomous system built on the GPT-4o LM that can also search and read papers, which achieves an accuracy of 35.3\% on CiteME. Overall, CiteME serves as a challenging testbed for open-ended claim attribution, driving the research community towards a future where any claim made by an LM can be automatically verified and discarded if found to be incorrect.

View on arXiv PDF Code

Similar