AILGFeb 4, 2025

The Elicitation Game: Evaluating Capability Elicitation Techniques

arXiv:2502.02180v314 citationsh-index: 7ICML
Originality Incremental advance
AI Analysis

This work addresses the challenge of ensuring reliable capability evaluations for AI systems, which is crucial for regulation and deployment, though it is incremental in refining existing methods.

The paper tackles the problem of accurately evaluating AI system capabilities by assessing the effectiveness of various elicitation techniques, finding that prompting works for multiple-choice QA tasks while fine-tuning is necessary for code generation and improves evaluation trustworthiness.

Capability evaluations are required to understand and regulate AI systems that may be deployed or further developed. Therefore, it is important that evaluations provide an accurate estimation of an AI system's capabilities. However, in numerous cases, previously latent capabilities have been elicited from models, sometimes long after initial release. Accordingly, substantial efforts have been made to develop methods for eliciting latent capabilities from models. In this paper, we evaluate the effectiveness of capability elicitation techniques by intentionally training model organisms -- language models with hidden capabilities that are revealed by a password. We introduce a novel method for training model organisms, based on circuit-breaking, which is more robust to elicitation techniques than standard password-locked models. We focus on elicitation techniques based on prompting and activation steering, and compare these to fine-tuning methods. Prompting techniques can elicit the actual capability of both password-locked and circuit-broken model organisms in the MCQA setting, while steering fails to do so. For a code-generation task, only fine-tuning can elicit the hidden capabilities of our novel model organism. Additionally, our results suggest that combining techniques improves elicitation. Still, if possible, fine-tuning should be the method of choice to improve the trustworthiness of capability evaluations.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes