CLFeb 28, 2025

Chain-of-Thought Matters: Improving Long-Context Language Models with Reasoning Path Supervision

Dawei Zhu, Xiyu Wei, Guangxiang Zhao, Wenhao Wu, Haosheng Zou, Junfeng Ran, Xun Wang, Lin Sun, Xiangzheng Zhang, Sujian Li

arXiv:2502.20790v120.924 citationsh-index: 12Has CodeEMNLP

Originality Incremental advance

AI Analysis

This addresses the problem of handling long-context reasoning tasks for AI researchers and practitioners, offering a novel framework with strong performance gains, though it builds incrementally on existing CoT methods.

The paper tackles the challenge of improving long-context language models by showing that Chain-of-Thought prompting benefits most long-context scenarios and amplifies with context length, and proposes LongRePS, a process-supervised framework that achieves significant improvements, such as +13.6/+3.8 points on MuSiQue and +9.3/+8.1 points on average across QA tasks.

Recent advances in Large Language Models (LLMs) have highlighted the challenge of handling long-context tasks, where models need to reason over extensive input contexts to aggregate target information. While Chain-of-Thought (CoT) prompting has shown promise for multi-step reasoning, its effectiveness for long-context scenarios remains underexplored. Through systematic investigation across diverse tasks, we demonstrate that CoT's benefits generalize across most long-context scenarios and amplify with increasing context length. Motivated by this critical observation, we propose LongRePS, a process-supervised framework that teaches models to generate high-quality reasoning paths for enhanced long-context performance. Our framework incorporates a self-sampling mechanism to bootstrap reasoning paths and a novel quality assessment protocol specifically designed for long-context scenarios. Experimental results on various long-context benchmarks demonstrate the effectiveness of our approach, achieving significant improvements over outcome supervision baselines on both in-domain tasks (+13.6/+3.8 points for LLaMA/Qwen on MuSiQue) and cross-domain generalization (+9.3/+8.1 points on average across diverse QA tasks). Our code, data and trained models are made public to facilitate future research.

View on arXiv PDF

Similar