AIFeb 17, 2025

STRIVE: Structured Reasoning for Self-Improvement in Claim Verification

Haisong Gong, Jing Li, Junfei Wu, Qiang Liu, Shu Wu, Liang Wang

arXiv:2502.11959v23 citationsh-index: 42Mach Intell Res

Originality Incremental advance

AI Analysis

This addresses the challenge of improving claim verification accuracy for AI systems, though it is incremental as it builds on existing self-improvement methods.

The paper tackles the problem of self-improvement methods in claim verification, where low-quality reasoning chains degrade performance, and proposes STRIVE, a structured reasoning approach that achieves a 31.4% performance gain over the base model and 20.7% over Chain of Thought on the HOVER datasets.

Claim verification is the task of determining whether a claim is supported or refuted by evidence. Self-improvement methods, where reasoning chains are generated and those leading to correct results are selected for training, have succeeded in tasks like mathematical problem solving. However, in claim verification, this approach struggles. Low-quality reasoning chains may falsely match binary truth labels, introducing faulty reasoning into the self-improvement process and ultimately degrading performance. To address this, we propose STRIVE: Structured Reasoning for Self-Improved Verification. Our method introduces a structured reasoning design with Claim Decomposition, Entity Analysis, and Evidence Grounding Verification. These components improve reasoning quality, reduce errors, and provide additional supervision signals for self-improvement. STRIVE begins with a warm-up phase, where the base model is fine-tuned on a small number of annotated examples to learn the structured reasoning design. It is then applied to generate reasoning chains for all training examples, selecting only those that are correct and structurally sound for subsequent self-improvement training. We demonstrate that STRIVE achieves significant improvements over baseline models, with a 31.4% performance gain over the base model and 20.7% over Chain of Thought on the HOVER datasets, highlighting its effectiveness.

View on arXiv PDF

Similar