CL AIJun 11, 2025

Unsupervised Elicitation of Language Models

Jiaxin Wen, Zachary Ankner, Arushi Somani, Peter Hase, Samuel Marks, Jacob Goldman-Wetzler, Linda Petrini, Henry Sleight, Collin Burns, He He, Shi Feng, Ethan Perez

Anthropic

arXiv:2506.10139v117.09 citationsh-index: 29

Originality Highly original

AI Analysis

This addresses the problem of eliciting superhuman capabilities in language models for AI researchers and practitioners, representing a novel approach rather than an incremental improvement.

The paper tackles the challenge of steering pretrained language models for downstream tasks when human supervision is difficult due to superhuman capabilities, by introducing an unsupervised algorithm called Internal Coherence Maximization (ICM) that fine-tunes models on their own generated labels without external supervision. The method matches or outperforms training on golden or human supervision across tasks like GSM8k-verification and TruthfulQA, and improves frontier LMs, such as a Claude 3.5 Haiku-based assistant that outperforms human-supervised counterparts.

To steer pretrained language models for downstream tasks, today's post-training paradigm relies on humans to specify desired behaviors. However, for models with superhuman capabilities, it is difficult or impossible to get high-quality human supervision. To address this challenge, we introduce a new unsupervised algorithm, Internal Coherence Maximization (ICM), to fine-tune pretrained language models on their own generated labels, \emph{without external supervision}. On GSM8k-verification, TruthfulQA, and Alpaca reward modeling tasks, our method matches the performance of training on golden supervision and outperforms training on crowdsourced human supervision. On tasks where LMs' capabilities are strongly superhuman, our method can elicit those capabilities significantly better than training on human labels. Finally, we show that our method can improve the training of frontier LMs: we use our method to train an unsupervised reward model and use reinforcement learning to train a Claude 3.5 Haiku-based assistant. Both the reward model and the assistant outperform their human-supervised counterparts.

View on arXiv PDF

Similar