SR-GRPO: Stable Rank as an Intrinsic Geometric Reward for Large Language Model Alignment
This addresses the challenge of scalable alignment for LLMs by reducing reliance on scarce human annotations and vulnerable reward models, though it is incremental as it builds on existing geometric insights.
The paper tackled the problem of aligning large language models with human preferences without external supervision by proposing stable rank as an intrinsic quality signal derived from model representations, achieving 84.04% accuracy on RewardBench and improving task accuracy by an average of 11.3 percentage points over greedy decoding.
Aligning Large Language Models (LLMs) with human preferences typically relies on external supervision, which faces critical limitations: human annotations are scarce and subjective, reward models are vulnerable to reward hacking, and self-evaluation methods suffer from prompt sensitivity and biases. In this work, we propose stable rank, an intrinsic, annotation-free quality signal derived from model representations. Stable rank measures the effective dimensionality of hidden states by computing the ratio of total variance to dominant-direction variance, capturing quality through how information distributes across representation dimensions. Empirically, stable rank achieves 84.04% accuracy on RewardBench and improves task accuracy by an average of 11.3 percentage points over greedy decoding via Best-of-N sampling. Leveraging this insight, we introduce Stable Rank Group Relative Policy Optimization (SR-GRPO), which uses stable rank as a reward signal for reinforcement learning. Without external supervision, SR-GRPO improves Qwen2.5-1.5B-Instruct by 10% on STEM and 19% on mathematical reasoning, outperforming both learned reward models and self-evaluation baselines. Our findings demonstrate that quality signals can be extracted from internal model geometry, offering a path toward scalable alignment without external supervision.