AINov 4, 2025

Optimizing AI Agent Attacks With Synthetic Data

arXiv:2511.02823v14 citationsh-index: 3
Originality Incremental advance
AI Analysis

This work addresses the challenge of estimating AI risk through better control evaluations, which is crucial for high-stakes AI deployments, though it appears incremental as it builds on existing frameworks and datasets.

The paper tackled the problem of optimizing attack policies for AI control evaluations in complex agentic environments with limited data, achieving a substantial improvement by reducing safety score from 0.87 to 0.41 using a scaffold that decomposes attack capability and simulates dynamics.

As AI deployments become more complex and high-stakes, it becomes increasingly important to be able to estimate their risk. AI control is one framework for doing so. However, good control evaluations require eliciting strong attack policies. This can be challenging in complex agentic environments where compute constraints leave us data-poor. In this work, we show how to optimize attack policies in SHADE-Arena, a dataset of diverse realistic control environments. We do this by decomposing attack capability into five constituent skills -- suspicion modeling, attack selection, plan synthesis, execution and subtlety -- and optimizing each component individually. To get around the constraint of limited data, we develop a probabilistic model of attack dynamics, optimize our attack hyperparameters using this simulation, and then show that the results transfer to SHADE-Arena. This results in a substantial improvement in attack strength, reducing safety score from a baseline of 0.87 to 0.41 using our scaffold.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes