AIMay 2

Grounding Multi-Hop Reasoning in Structural Causal Models via Group Relative Policy Optimization

Yunhan Bu, Quan Zhang, Huaping Zhang, Guotong Geng, Chunxiao Gao, Askar Hamdulla, Juan Wang, Qiuchi Li, Baohua Zhang, Shuai Lei, Yunbo Cao, Zhunchen Luo

arXiv:2605.0148295.8h-index: 2

AI Analysis

Improves reliability and interpretability of multi-hop fact verification for LLMs, addressing a key bottleneck in complex reasoning tasks.

Multi-Hop Fact Verification suffers from hallucinations and fractured logical chains in LLMs. The authors propose grounding reasoning in a Structural Causal Model and using GRPO to optimize reasoning depth, achieving SOTA on HoVer and EX-FEVER.

Multi-Hop Fact Verification (MHFV) necessitates complex reasoning across disparate evidence, posing significant challenges for Large Language Models (LLMs) which often suffer from hallucinations and fractured logical chains. Existing methods, while improving transparency via Chain-of-Thought (CoT), lack explicit modeling of the causal dependencies between evidence and claims. In this work, we introduce a novel framework that grounds reasoning in a Structural Causal Model (SCM), treating verification as a constructive causal inference process. We empirically identify an "inverted U-shaped" correlation between reasoning chain length and accuracy, revealing that excessive structural complexity degrades performance. To address this, we propose a Rule-based Reinforcement Learning strategy using Group Relative Policy Optimization (GRPO). This approach dynamically optimizes the trade-off between structural depth and conciseness. Extensive experiments on HoVer and EX-FEVER demonstrate that our SCM-GRPO framework significantly outperforms state-of-the-art baselines, offering a reliable and interpretable solution for complex fact verification.

View on arXiv PDF

Similar