CLFeb 28, 2025

Chain-of-Thought Matters: Improving Long-Context Language Models with Reasoning Path Supervision

arXiv:2502.20790v124 citationsh-index: 12Has CodeEMNLP
Originality Incremental advance
AI Analysis

This addresses the problem of handling long-context reasoning tasks for AI researchers and practitioners, offering a novel framework with strong performance gains, though it builds incrementally on existing CoT methods.

The paper tackles the challenge of improving long-context language models by showing that Chain-of-Thought prompting benefits most long-context scenarios and amplifies with context length, and proposes LongRePS, a process-supervised framework that achieves significant improvements, such as +13.6/+3.8 points on MuSiQue and +9.3/+8.1 points on average across QA tasks.

Recent advances in Large Language Models (LLMs) have highlighted the challenge of handling long-context tasks, where models need to reason over extensive input contexts to aggregate target information. While Chain-of-Thought (CoT) prompting has shown promise for multi-step reasoning, its effectiveness for long-context scenarios remains underexplored. Through systematic investigation across diverse tasks, we demonstrate that CoT's benefits generalize across most long-context scenarios and amplify with increasing context length. Motivated by this critical observation, we propose LongRePS, a process-supervised framework that teaches models to generate high-quality reasoning paths for enhanced long-context performance. Our framework incorporates a self-sampling mechanism to bootstrap reasoning paths and a novel quality assessment protocol specifically designed for long-context scenarios. Experimental results on various long-context benchmarks demonstrate the effectiveness of our approach, achieving significant improvements over outcome supervision baselines on both in-domain tasks (+13.6/+3.8 points for LLaMA/Qwen on MuSiQue) and cross-domain generalization (+9.3/+8.1 points on average across diverse QA tasks). Our code, data and trained models are made public to facilitate future research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes