LGCLMay 24

Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

arXiv:2605.2518991.3
AI Analysis

For practitioners training language models with RL, this work provides a method to mitigate reward hacking, a critical failure mode that undermines alignment with intended tasks.

The paper identifies that reward hacking in RL for language models arises from optimization drifting away from stable low-dimensional trajectories, and introduces trusted-direction projection to constrain gradients within a clean reference subspace, which delays shortcut exploitation and better preserves task performance in mathematical reasoning experiments.

Reward hacking arises when a model improves a proxy reward by exploiting shortcuts rather than solving the intended task. We study this failure mode through the geometry of reinforcement learning updates in language models and argue that hacking emerges when optimization drifts away from a stable low-dimensional learning trajectory. We analyze this drift through dominant singular directions of parameter updates and show that reward-hacking runs exhibit substantially larger directional change than clean runs. Motivated by this observation, we introduce trusted-direction projection, which constrains gradients to remain within a clean reference subspace. Across reward-hacking experiments on mathematical reasoning, the proposed approach delays shortcut exploitation and better preserves task performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes