AINEJan 20

AgenticRed: Optimizing Agentic Systems for Automated Red-teaming

arXiv:2601.13518v12 citationsh-index: 2
Originality Highly original
AI Analysis

This addresses the need for scalable and unbiased AI safety evaluation, offering a novel automated approach that significantly outperforms existing methods.

The paper tackled the problem of automating red-teaming for AI safety by introducing AgenticRed, which designs and refines red-teaming systems without human intervention, achieving up to 96% attack success rate on Llama-2-7B (a 36% improvement) and 100% on GPT-3.5-Turbo.

While recent automated red-teaming methods show promise for systematically exposing model vulnerabilities, most existing approaches rely on human-specified workflows. This dependence on manually designed workflows suffers from human biases and makes exploring the broader design space expensive. We introduce AgenticRed, an automated pipeline that leverages LLMs' in-context learning to iteratively design and refine red-teaming systems without human intervention. Rather than optimizing attacker policies within predefined structures, AgenticRed treats red-teaming as a system design problem. Inspired by methods like Meta Agent Search, we develop a novel procedure for evolving agentic systems using evolutionary selection, and apply it to the problem of automatic red-teaming. Red-teaming systems designed by AgenticRed consistently outperform state-of-the-art approaches, achieving 96% attack success rate (ASR) on Llama-2-7B (36% improvement) and 98% on Llama-3-8B on HarmBench. Our approach exhibits strong transferability to proprietary models, achieving 100% ASR on GPT-3.5-Turbo and GPT-4o-mini, and 60% on Claude-Sonnet-3.5 (24% improvement). This work highlights automated system design as a powerful paradigm for AI safety evaluation that can keep pace with rapidly evolving models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes