Constitutional Black-Box Monitoring for Scheming in LLM Agents

Simon Storf, Rich Barton-Cooper, James Peters-Gill, Marius Hobbhahn

arXiv:2603.00829v14.37 citationsh-index: 7

Originality Incremental advance

AI Analysis

This addresses safety for deploying autonomous LLM agents, but the results are incremental as performance gains are limited.

The paper tackled the problem of detecting scheming in LLM agents by developing constitutional black-box monitors optimized on synthetic data, and found that these monitors generalize to realistic environments but performance saturates quickly with simple methods matching more extensive optimization.

Safe deployment of Large Language Model (LLM) agents in autonomous settings requires reliable oversight mechanisms. A central challenge is detecting scheming, where agents covertly pursue misaligned goals. One approach to mitigating such risks is LLM-based monitoring: using language models to examine agent behaviors for suspicious actions. We study constitutional black-box monitors: prompted classifiers that detect scheming using only externally observable inputs and outputs, optimized on synthetic data generated from natural-language behavior specifications. We introduce two pipelines for generating synthetic agent trajectories, STRIDE (iterative refinement) and Gloom (agent-environment simulation), from which we generate 1,000 samples each. We optimize frontier LLM monitors on these datasets via prompt sweeps, human refinement, and automated prompt optimization, and evaluate performance on 7,500 held-out trajectories from ControlArena, a suite of grounded environments where agents operate in more realistic contexts. Our results demonstrate that monitors selected purely on synthetic data can generalize to more realistic environments, capturing a meaningful scheming signal. However, we find that performance saturates quickly in our setting, with simple prompt sweeps matching the results of more extensive optimization. Pushing beyond this limit yields no further improvements and instead leads to overfitting.

View on arXiv PDF

Similar