AICLLGDec 5, 2024

MISR: Measuring Instrumental Self-Reasoning in Frontier Models

arXiv:2412.03904v11 citationsh-index: 1Has Code
Originality Incremental advance
AI Analysis

This work addresses the need for measuring instrumental self-reasoning in AI agents, which is crucial for assessing risks like deceptive alignment, but it is incremental as it builds on prior evaluations by extending them to agentic tasks.

The authors tackled the problem of evaluating instrumental self-reasoning ability in large language model agents, finding that this ability emerges only in the most capable frontier models and is highly context-dependent, with no model passing the most difficult evaluations.

We propose a suite of tasks to evaluate the instrumental self-reasoning ability of large language model (LLM) agents. Instrumental self-reasoning ability could improve adaptability and enable self-modification, but it could also pose significant risks, such as enabling deceptive alignment. Prior work has only evaluated self-reasoning in non-agentic settings or in limited domains. In this paper, we propose evaluations for instrumental self-reasoning ability in agentic tasks in a wide range of scenarios, including self-modification, knowledge seeking, and opaque self-reasoning. We evaluate agents built using state-of-the-art LLMs, including commercial and open source systems. We find that instrumental self-reasoning ability emerges only in the most capable frontier models and that it is highly context-dependent. No model passes the the most difficult versions of our evaluations, hence our evaluation can be used to measure increases in instrumental self-reasoning ability in future models. We open-source our evaluations at https://github.com/kaifronsdal/Self-Reasoning-Evals.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes