CLOct 25, 2025

Confabulations from ACL Publications (CAP): A Dataset for Scientific Hallucination Detection

arXiv:2510.22395v14 citationsh-index: 4
Originality Synthesis-oriented
AI Analysis

This addresses the issue of factual distortions in scientific NLP for researchers and developers, though it is incremental as it provides a new dataset rather than a novel detection method.

The authors tackled the problem of hallucinations in large language models (LLMs) for scientific text generation by introducing the CAP dataset, a multilingual resource with 900 questions and over 7000 annotated answers from 16 models to detect factuality errors and fluency issues.

We introduce the CAP (Confabulations from ACL Publications) dataset, a multilingual resource for studying hallucinations in large language models (LLMs) within scientific text generation. CAP focuses on the scientific domain, where hallucinations can distort factual knowledge, as they frequently do. In this domain, however, the presence of specialized terminology, statistical reasoning, and context-dependent interpretations further exacerbates these distortions, particularly given LLMs' lack of true comprehension, limited contextual understanding, and bias toward surface-level generalization. CAP operates in a cross-lingual setting covering five high-resource languages (English, French, Hindi, Italian, and Spanish) and four low-resource languages (Bengali, Gujarati, Malayalam, and Telugu). The dataset comprises 900 curated scientific questions and over 7000 LLM-generated answers from 16 publicly available models, provided as question-answer pairs along with token sequences and corresponding logits. Each instance is annotated with a binary label indicating the presence of a scientific hallucination, denoted as a factuality error, and a fluency label, capturing issues in the linguistic quality or naturalness of the text. CAP is publicly released to facilitate advanced research on hallucination detection, multilingual evaluation of LLMs, and the development of more reliable scientific NLP systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes