CLAug 31, 2025

EviNote-RAG: Enhancing RAG Models via Answer-Supportive Evidence Notes

Yuqin Dai, Guoqing Wang, Yuan Wang, Kairan Dou, Kaichen Zhou, Zhanwei Zhang, Shuo Yang, Fei Tang, Jun Yin, Pengyu Zeng, Zhenzhe Ying, Can Yi

arXiv:2509.00877v313.98 citationsh-index: 8

Originality Highly original

AI Analysis

This work addresses the problem of enhancing reasoning accuracy and efficiency in RAG models for question answering, representing a novel method for a known bottleneck rather than a foundational advancement.

The paper tackles the challenges of low signal-to-noise ratio and error accumulation in Retrieval-Augmented Generation (RAG) models for open-domain question answering by introducing EviNote-RAG, a framework that uses Supportive-Evidence Notes and an entailment-based reward to improve reasoning, resulting in state-of-the-art performance with relative F1 gains of 20% on HotpotQA, 40% on Bamboogle, and 91% on 2Wiki.

Retrieval-Augmented Generation (RAG) has advanced open-domain question answering by incorporating external information into model reasoning. However, effectively leveraging external information to enhance reasoning presents the following challenges: (1) low signal-to-noise ratio, where answer-supportive external information is diluted by irrelevant material, and (2) error accumulation, which arises in multi-hop reasoning when incomplete or misleading information is incorporated. To address these challenges, we introduce EviNote-RAG, a framework that follows a retrieve-note-answer workflow. Instead of reasoning directly over raw external information, the model first produces Supportive-Evidence Notes (SENs), which concisely preserve answer-critical information and explicitly mark key and uncertainty information to improve accuracy. We further design an entailment-based Evidence Quality Reward (EQR) to ensure that SENs are logically sufficient to derive the final answer, thereby enhancing SENs' quality. Experiments on both in-domain and out-of-domain QA benchmarks show that EviNote-RAG achieves state-of-the-art performance, improving answer accuracy, training stability, robustness, and efficiency. In particular, it yields relative F1 gains of 20% on HotpotQA (+0.093), 40% on Bamboogle (+0.151), and 91% on 2Wiki (+0.256), benefiting from improvements in the reasoning process.

View on arXiv PDF

Similar