CL AIJan 17, 2025

ArxEval: Evaluating Retrieval and Generation in Language Models for Scientific Literature

Aarush Sinha, Viraj Virk, Dipshikha Chakraborty, P. S. Sreeja

arXiv:2501.10483v22.7h-index: 1

Originality Synthesis-oriented

AI Analysis

This work addresses the issue of factual inaccuracies in language models for scientific literature, which is critical for domains like academia and education, but it is incremental as it focuses on evaluation rather than solving hallucination directly.

The authors tackled the problem of hallucination in language models when generating scientific information by introducing ArxEval, an evaluation pipeline using ArXiv with tasks like Jumbled Titles and Mixed Titles, and found comparative insights into the reliability of fifteen widely used models.

Language Models [LMs] are now playing an increasingly large role in information generation and synthesis; the representation of scientific knowledge in these systems needs to be highly accurate. A prime challenge is hallucination; that is, generating apparently plausible but actually false information, including invented citations and nonexistent research papers. This kind of inaccuracy is dangerous in all the domains that require high levels of factual correctness, such as academia and education. This work presents a pipeline for evaluating the frequency with which language models hallucinate in generating responses in the scientific literature. We propose ArxEval, an evaluation pipeline with two tasks using ArXiv as a repository: Jumbled Titles and Mixed Titles. Our evaluation includes fifteen widely used language models and provides comparative insights into their reliability in handling scientific literature.

View on arXiv PDF

Similar