LG AIMar 3

Why Does RLAIF Work At All?

arXiv:2603.03000v1h-index: 1

Originality Highly original

AI Analysis

This provides a foundational theoretical account for RLAIF, addressing a key problem in AI alignment for researchers and practitioners, though it is incremental in formalizing existing intuitions.

The paper tackles the lack of theoretical explanation for why Reinforcement Learning from AI Feedback (RLAIF) works in aligning language models, proposing the latent value hypothesis that pretraining encodes human values as directions in representation space, and shows that RLAIF improves alignment when constitutional prompts activate value-relevant directions better than default generation.

Reinforcement Learning from AI Feedback (RLAIF) enables language models to improve by training on their own preference judgments, yet no theoretical account explains why this self-improvement seemingly works for value learning. We propose the latent value hypothesis, that pretraining on internet-scale data encodes human values as directions in representation space, and constitutional prompts elicit these latent values into preference judgments. We formalize this intuition under a linear model where the constitution acts as a projection operator selecting value-relevant directions. Our analysis yields several results. RLAIF improves alignment when the constitution-activated direction correlates with true values better than the model's default generation direction thus explaining the generation-judgment gap; the ceiling on RLAIF quality is determined by how well representations encode values, which scales with model capacity; and adversarial constitutions exist that can activate anti-social value directions encoded from harmful pretraining data. Our account unifies scattered empirical findings including the refusal direction, low-rank safety subspaces, and RLAIF scaling behavior.

View on arXiv PDF

Similar