LGAIApr 4

Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation

arXiv:2604.1308819.9h-index: 2
Predicted impact top 22% in LG · last 90 daysOriginality Incremental advance
AI Analysis

Provides a design principle for reinforcement learning fine-tuning of reasoning models, addressing training instability issues like entropy collapse.

The paper identifies a necessary condition for intra-group learning of sequence-level rewards to prevent reward-irrelevant drift, and proposes minimal transformations to restore gradient cancellation on weak-credit tokens, achieving stabilized training and improved sample efficiency.

In sparse termination rewards, intra-group comparisons have become the dominant paradigm for fine-tuning reasoning models via reinforcement learning. However, long-term training often leads to issues like ineffective update accumulation (learning tax), solution probability drift, and entropy collapse. This paper presents a necessary condition for algorithm design from a token-level credit assignment perspective: to prevent reward-irrelevant drift, intra-group objectives must maintain gradient exchangeability across token updates, enabling gradient cancellation on weak-credit/high-frequency tokens. We show that two common mechanisms disrupting exchangeability make "non-cancellation" a structural norm. Based on this, we propose minimal intra-group transformations to restore or approximate the cancellation structure in the shared token space. Experimental results demonstrate that these transformations stabilize training, improve sample efficiency, and enhance final performance, validating the value of this design condition.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes