CL CPMar 11

FinReflectKG -- HalluBench: GraphRAG Hallucination Benchmark for Financial Question Answering Systems

Mahesh Kumar, Bhaskarjit Sarmah, Stefano Pasquali

arXiv:2603.202529.6h-index: 5

AI Analysis

This addresses the critical need for reliable AI in financial information systems to prevent regulatory violations and flawed decisions, though it is incremental as it focuses on benchmarking existing detection methods.

The paper tackles the problem of hallucinations in knowledge graph-augmented financial QA systems by introducing a benchmark dataset with 755 annotated examples, finding that LLM-based judges and embedding methods achieve F1 scores of 0.82-0.86 under clean conditions but degrade significantly with noisy triplets, except embedding methods which show only 9% degradation.

As organizations increasingly integrate AI-powered question-answering systems into financial information systems for compliance, risk assessment, and decision support, ensuring the factual accuracy of AI-generated outputs becomes a critical engineering challenge. Current Knowledge Graph (KG)-augmented QA systems lack systematic mechanisms to detect hallucinations - factually incorrect outputs that undermine reliability and user trust. We introduce FinBench-QA-Hallucination, a benchmark for evaluating hallucination detection methods in KG-augmented financial QA over SEC 10-K filings. The dataset contains 755 annotated examples from 300 pages, each labeled for groundedness using a conservative evidence-linkage protocol requiring support from both textual chunks and extracted relational triplets. We evaluate six detection approaches - LLM judges, fine-tuned classifiers, Natural Language Inference (NLI) models, span detectors, and embedding-based methods under two conditions: with and without KG triplets. Results show that LLM-based judges and embedding approaches achieve the highest performance (F1: 0.82-0.86) under clean conditions. However, most methods degrade significantly when noisy triplets are introduced, with Matthews Correlation Coefficient (MCC) dropping 44-84 percent, while embedding methods remain relatively robust with only 9 percent degradation. Statistical tests (Cochran's Q and McNemar) confirm significant performance differences (p < 0.001). Our findings highlight vulnerabilities in current KG-augmented systems and provide insights for building reliable financial information systems, where hallucinations can lead to regulatory violations and flawed decisions. The benchmark also offers a framework for integrating AI reliability evaluation into information system design across other high-stakes domains such as healthcare, legal, and government.

View on arXiv PDF

Similar