LG CLAug 28, 2025

Mirage or Method? How Model-Task Alignment Induces Divergent RL Conclusions

Haoze Wu, Cheng Wang, Wenshuo Zhao, Junxian He

arXiv:2508.21188v215.76 citationsh-index: 7

Originality Incremental advance

AI Analysis

This work clarifies inconsistent findings in RL for large language models, addressing a problem for researchers and practitioners by showing that many reported phenomena are context-dependent and not generalizable.

The paper investigates why certain counterintuitive reinforcement learning (RL) claims, such as single-example training matching full-dataset performance, hold only when pretrained models already have strong model-task alignment, as measured by pass@k accuracy, and fail in more challenging settings where standard RL remains robust.

Recent advances in applying reinforcement learning (RL) to large language models (LLMs) have led to substantial progress. In particular, a series of remarkable yet often counterintuitive phenomena have been reported in LLMs, exhibiting patterns not typically observed in traditional RL settings. For example, notable claims include that a single training example can match the performance achieved with an entire dataset, that the reward signal does not need to be very accurate, and that training solely with negative samples can match or even surpass sophisticated reward-based methods. However, the precise conditions under which these observations hold - and, critically, when they fail - remain unclear. In this work, we identify a key factor that differentiates RL observations: whether the pretrained model already exhibits strong Model-Task Alignment, as measured by pass@k accuracy on the evaluated task. Through a systematic and comprehensive examination of a series of counterintuitive claims, supported by rigorous experimental validation across different model architectures and task domains, our findings show that while standard RL training remains consistently robust across settings, many of these counterintuitive results arise only when the model and task already exhibit strong model-task alignment. In contrast, these techniques fail to drive substantial learning in more challenging regimes, where standard RL methods remain effective.

View on arXiv PDF

Similar