Adaptively profiling models with task elicitation
This addresses the challenge for AI researchers and practitioners in efficiently identifying and profiling model failure modes across domains like forecasting and online harassment, representing a novel approach rather than an incremental improvement.
The paper tackles the problem of language model evaluations failing to characterize consequential failure modes by introducing task elicitation, a method that automatically builds new evaluations to profile model behavior, finding hundreds of natural-language tasks where frontier models exhibit systematic failures, such as Sonnet 3.5 over-associating quantum computing and AGI and o3-mini hallucinating when fabrications are repeated in-context.
Language model evaluations often fail to characterize consequential failure modes, forcing experts to inspect outputs and build new benchmarks. We introduce task elicitation, a method that automatically builds new evaluations to profile model behavior. Task elicitation finds hundreds of natural-language tasks -- an order of magnitude more than prior work -- where frontier models exhibit systematic failures, in domains ranging from forecasting to online harassment. For example, we find that Sonnet 3.5 over-associates quantum computing and AGI and that o3-mini is prone to hallucination when fabrications are repeated in-context.