LGMay 28

K-FinHallu: A Hallucination Detection Benchmark for Multi-Turn RAG in Korean Finance

arXiv:2605.2952395.3h-index: 6Has Code
AI Analysis

This benchmark addresses the lack of evaluation tools for hallucination detection in multi-turn, Korean financial RAG, a high-stakes domain with unique linguistic and regulatory challenges.

K-FinHallu is the first benchmark for detecting hallucinations in multi-turn Korean financial RAG. Even the best models struggle with fine-grained financial diagnostics and justified abstention, though fine-tuning an 8B model achieves competitive performance with frontier LLMs.

Large Language Models (LLMs) have advanced financial automation through Retrieval-Augmented Generation (RAG), yet hallucinations remain a critical barrier to deployment in high-stakes environments. Existing benchmarks focus on single-turn, English-centric tasks, leaving the multi-turn dynamics and linguistic-regulatory nuances of the Korean financial domain unaddressed. We introduce K-FinHallu, the first benchmark for hallucination detection in multi-turn Korean financial RAG. We construct multi-turn dialogues from authentic Korean financial documents and inject hallucinations under a proposed hierarchical taxonomy based on context answerability that explicitly accounts for justified abstention. Benchmarking frontier and open-source LLMs as hallucination detectors, we find that even the strongest models struggle with fine-grained financial diagnostics and refusal behavior. While fine-tuning an 8B model on our training split yields performance competitive with frontier LLMs, justified abstention remains the weakest axis across all evaluated models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes