MLLGMay 13

The Mechanism of Weak-to-Strong Generalization: Feature Elicitation from Latent Knowledge

arXiv:2605.1290896.9
AI Analysis

Provides theoretical justification for weak-to-strong alignment in superhuman AI, addressing a key bottleneck in AI safety.

This paper proves that weak-to-strong generalization occurs in feature-learning regimes with two-layer neural networks, where the strong model efficiently learns the target task while retaining general capabilities, unlike standard fine-tuning which causes catastrophic forgetting.

Weak-to-strong (W2S) generalization, in which a strong model is fine-tuned on outputs of a weaker, task-specialized model, has been proposed as an approach to aligning superhuman AI systems. Existing theoretical analyses either fix the student's representations or operate in restricted settings. Whether multi-step SGD can succeed in feature learning while preserving diverse pre-trained capabilities remains open. We study W2S in the setting of reward-model learning with two-layer neural networks. The strong model has pre-trained representations organized into low-dimensional subspaces $V_k$, and is fine-tuned under the supervision of a weak model specialized on task $κ$. We prove that the strong model efficiently learns task $κ$, eliciting its pre-trained knowledge while retaining general capabilities. This establishes W2S generalization in the feature-learning regime, in the sense that the strong model acquires the target feature direction through W2S training, rather than having it given a priori. Moreover, W2S preserves pre-trained off-target features, whereas standard supervised fine-tuning causes catastrophic forgetting when off-target feature directions are correlated with the target's. Numerical experiments on synthetic data confirm our theoretical results.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes