CLFeb 9

Document Reconstruction Unlocks Scalable Long-Context RLVR

Yao Xiao, Lei Wang, Yue Deng, Guanzheng Chen, Ziqi Jin, Jung-jae Kim, Xiaoli Li, Roy Ka-wei Lee, Lidong Bing

arXiv:2602.08237v10.6h-index: 6

Originality Incremental advance

AI Analysis

This addresses the scalability issue in long-context reinforcement learning for AI researchers by providing an incremental method to reduce reliance on expensive annotations.

The paper tackles the problem of enhancing long-context capabilities in Large Language Models without costly human annotations or teacher supervision by using an unsupervised reinforcement learning approach where models reconstruct documents with missing paragraphs. It achieves noticeable gains on the RULER benchmark and reasonable improvement on LongBench~v2, with results validated through ablation studies on reward design and data scaling.

Reinforcement Learning with Verifiable Rewards~(RLVR) has become a prominent paradigm to enhance the capabilities (i.e.\ long-context) of Large Language Models~(LLMs). However, it often relies on gold-standard answers or explicit evaluation rubrics provided by powerful teacher models or human experts, which are costly and time-consuming. In this work, we investigate unsupervised approaches to enhance the long-context capabilities of LLMs, eliminating the need for heavy human annotations or teacher models' supervision. Specifically, we first replace a few paragraphs with special placeholders in a long document. LLMs are trained through reinforcement learning to reconstruct the document by correctly identifying and sequencing missing paragraphs from a set of candidate options. This training paradigm enables the model to capture global narrative coherence, significantly boosting long-context performance. We validate the effectiveness of our method on two widely used benchmarks, RULER and LongBench~v2. While acquiring noticeable gains on RULER, it can also achieve a reasonable improvement on LongBench~v2 without any manually curated long-context QA data. Furthermore, we conduct extensive ablation studies to analyze the impact of reward design, data curation strategies, training schemes, and data scaling effects on model performance. We publicly release our code, data, and models.

View on arXiv PDF

Similar