LGAICLFeb 25

GradAlign: Gradient-Aligned Data Selection for LLM Reinforcement Learning

CMU
arXiv:2602.21492v13 citationsh-index: 34Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of inefficient RL training for LLMs due to poor data selection, offering a novel method that is incremental but improves over heuristic baselines.

The paper tackles the sensitivity of reinforcement learning (RL) for large language models (LLMs) to training data quality by proposing GradAlign, a gradient-aligned data selection method that prioritizes problems aligning with validation gradients, resulting in more stable training and improved performance across challenging data regimes.

Reinforcement learning (RL) has become a central post-training paradigm for large language models (LLMs), but its performance is highly sensitive to the quality of training problems. This sensitivity stems from the non-stationarity of RL: rollouts are generated by an evolving policy, and learning is shaped by exploration and reward feedback, unlike supervised fine-tuning (SFT) with fixed trajectories. As a result, prior work often relies on manual curation or simple heuristic filters (e.g., accuracy), which can admit incorrect or low-utility problems. We propose GradAlign, a gradient-aligned data selection method for LLM reinforcement learning that uses a small, trusted validation set to prioritize training problems whose policy gradients align with validation gradients, yielding an adaptive curriculum. We evaluate GradAlign across three challenging data regimes: unreliable reward signals, distribution imbalance, and low-utility training corpus, showing that GradAlign consistently outperforms existing baselines, underscoring the importance of directional gradient signals in navigating non-stationary policy optimization and yielding more stable training and improved final performance. We release our implementation at https://github.com/StigLidu/GradAlign

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes