CLJun 18, 2024

On-Policy Self-Alignment with Fine-grained Knowledge Feedback for Hallucination Mitigation

arXiv:2406.12221v69 citations
Originality Highly original
AI Analysis

This addresses the problem of unreliable outputs in LLMs for users needing accurate information, representing a novel approach to alignment.

The paper tackles hallucination in large language models by introducing RLFH, an on-policy self-alignment method that uses fine-grained feedback to enable models to self-correct, achieving improved performance on benchmarks like HotpotQA, SQuADv2, and Biography.

Hallucination occurs when large language models exhibit behavior that deviates from the boundaries of their knowledge during response generation. To address this critical issue, previous learning-based methods attempt to finetune models but are limited by off-policy sampling and coarse-grained feedback. In this paper, we present \textit{\b{R}einforcement \b{L}earning \b{f}or \b{H}allucination} (RLFH), an on-policy self-alignment approach that enables LLMs to actively explore their knowledge boundaries and self-correct generation behavior through fine-grained feedback signals. RLFH introduces a self-assessment framework where the policy serves as its own judge. Through this framework, responses are automatically decomposed into atomic facts and their truthfulness and informativeness are assessed against external knowledge sources. The resulting fine-grained feedback at the statement level are then converted into token-level dense reward signals. This enables online reinforcement learning to achieve precise and timely optimization without human intervention. Comprehensive evaluations on HotpotQA, SQuADv2, and Biography benchmarks validate RLFH's effectiveness in hallucination mitigation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes