Atomic Consistency Preference Optimization for Long-Form Question Answering
This addresses the problem of factual unreliability in LLMs for long-form question answering, offering a novel self-supervised approach that eliminates dependency on external models or knowledge bases.
The paper tackles the problem of factoid hallucinations in Large Language Models by proposing Atomic Consistency Preference Optimization (ACPO), a self-supervised method that improves factual accuracy without external supervision, achieving a 1.95-point improvement over supervised baselines on benchmark datasets.
Large Language Models (LLMs) often produce factoid hallucinations - plausible yet incorrect answers. A common mitigation strategy is model alignment, which improves factual accuracy by training on curated (factual, non-factual) pairs. However, this approach often relies on a stronger model (e.g., GPT-4) or an external knowledge base to assess factual correctness that may not always be accessible. Addressing this, we propose Atomic Consistency Preference Optimization (ACPO), a self-supervised preference-tuning method that enhances factual accuracy without external supervision. ACPO leverages atomic consistency signals (i.e., the agreement of individual facts across multiple stochastic responses) to identify high- and low-quality data pairs for model alignment. Despite being fully self-supervised, ACPO outperforms the strong supervised alignment baseline by 1.95 points averaged across Phi-3 and Llama3 on the LongFact and BioGen datasets, demonstrating its effectiveness in improving factual reliability without relying on external models or knowledge bases.