FineDialFact: A benchmark for Fine-grained Dialogue Fact Verification
This addresses the challenge of verifying factual accuracy in dialogue responses for NLP applications, but it is incremental as it builds on existing datasets and methods.
The paper tackles the problem of hallucination detection in dialogue systems by introducing FineDialFact, a benchmark for fine-grained dialogue fact verification, and shows that Chain-of-Thought reasoning methods improve performance but achieve only a 0.75 F1-score on an open-domain dataset.
Large Language Models (LLMs) are known to produce hallucinations - factually incorrect or fabricated information - which poses significant challenges for many Natural Language Processing (NLP) applications, such as dialogue systems. As a result, detecting hallucinations has become a critical area of research. Current approaches to hallucination detection in dialogue systems primarily focus on verifying the factual consistency of generated responses. However, these responses often contain a mix of accurate, inaccurate or unverifiable facts, making one factual label overly simplistic and coarse-grained. In this paper, we introduce a benchmark, FineDialFact, for fine-grained dialogue fact verification, which involves verifying atomic facts extracted from dialogue responses. To support this, we construct a dataset based on publicly available dialogue datasets and evaluate it using various baseline methods. Experimental results demonstrate that methods incorporating Chain-of-Thought (CoT) reasoning can enhance performance in dialogue fact verification. Despite this, the best F1-score achieved on the HybriDialogue, an open-domain dialogue dataset, is only 0.75, indicating that the benchmark remains a challenging task for future research. Our dataset and code will be public on GitHub.