CV AI CLOct 16, 2024

Controlled Automatic Task-Specific Synthetic Data Generation for Hallucination Detection

Yong Xie, Karan Aggarwal, Aitzaz Ahmad, Stephen Lau

arXiv:2410.12278v12.01 citationsh-index: 3

Originality Incremental advance

AI Analysis

This addresses the challenge of limited labeled data for hallucination detection in natural language processing, though it is incremental as it builds on existing synthetic data generation methods.

The paper tackles the problem of detecting hallucinations in text by automatically generating task-specific synthetic datasets, resulting in hallucination detectors that outperform in-context-learning-based detectors by 32% on three datasets.

We present a novel approach to automatically generate non-trivial task-specific synthetic datasets for hallucination detection. Our approach features a two-step generation-selection pipeline, using hallucination pattern guidance and a language style alignment during generation. Hallucination pattern guidance leverages the most important task-specific hallucination patterns while language style alignment aligns the style of the synthetic dataset with benchmark text. To obtain robust supervised detectors from synthetic datasets, we also adopt a data mixture strategy to improve performance robustness and generalization. Our results on three datasets show that our generated hallucination text is more closely aligned with non-hallucinated text versus baselines, to train hallucination detectors with better generalization. Our hallucination detectors trained on synthetic datasets outperform in-context-learning (ICL)-based detectors by a large margin of 32%. Our extensive experiments confirm the benefits of our approach with cross-task and cross-generator generalization. Our data-mixture-based training further improves the generalization and robustness of hallucination detection.

View on arXiv PDF

Similar