ML LGMay 13

The Mechanism of Weak-to-Strong Generalization: Feature Elicitation from Latent Knowledge

arXiv:2605.1290896.9

AI Analysis

Provides theoretical justification for weak-to-strong alignment in superhuman AI, addressing a key bottleneck in AI safety.

This paper proves that weak-to-strong generalization occurs in feature-learning regimes with two-layer neural networks, where the strong model efficiently learns the target task while retaining general capabilities, unlike standard fine-tuning which causes catastrophic forgetting.

Weak-to-strong (W2S) generalization, in which a strong model is fine-tuned on outputs of a weaker, task-specialized model, has been proposed as an approach to aligning superhuman AI systems. Existing theoretical analyses either fix the student's representations or operate in restricted settings. Whether multi-step SGD can succeed in feature learning while preserving diverse pre-trained capabilities remains open. We study W2S in the setting of reward-model learning with two-layer neural networks. The strong model has pre-trained representations organized into low-dimensional subspaces $V_k$, and is fine-tuned under the supervision of a weak model specialized on task $κ$. We prove that the strong model efficiently learns task $κ$, eliciting its pre-trained knowledge while retaining general capabilities. This establishes W2S generalization in the feature-learning regime, in the sense that the strong model acquires the target feature direction through W2S training, rather than having it given a priori. Moreover, W2S preserves pre-trained off-target features, whereas standard supervised fine-tuning causes catastrophic forgetting when off-target feature directions are correlated with the target's. Numerical experiments on synthetic data confirm our theoretical results.

View on arXiv PDF

Similar