CRCLApr 28

Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models

arXiv:2512.2067779.41 citationsh-index: 7
AI Analysis

For developers and auditors of LLMs, this provides a scalable, reproducible method for systematic robustness evaluation, addressing the limitations of manual expert-driven red-teaming.

The paper tackles automated adversarial red-teaming for LLMs, proposing a learning-driven framework that discovers 47 vulnerabilities (21 high-severity, 12 novel) in GPT-OSS-20B, achieving 3.9× higher discovery rate and 89% detection accuracy compared to manual red-teaming.

The increasing deployment of large language models (LLMs) in safety-critical applications raises fundamental challenges in systematically evaluating robustness against adversarial behaviors. Existing red-teaming practices are largely manual and expert-driven, which limits scalability, reproducibility, and coverage in high-dimensional prompt spaces. We formulate automated LLM red-teaming as a structured adversarial search problem and propose a learning-driven framework for scalable vulnerability discovery. The approach combines meta-prompt-guided adversarial prompt generation with a hierarchical execution and detection pipeline, enabling standardized evaluation across six representative threat categories, including reward hacking, deceptive alignment, data exfiltration, sandbagging, inappropriate tool use, and chain-of-thought manipulation. Extensive experiments on GPT-OSS-20B identify 47 vulnerabilities, including 21 high-severity failures and 12 previously undocumented attack patterns. Compared with manual red-teaming under matched query budgets, our method achieves a 3.9$\times$ higher discovery rate with 89\% detection accuracy, demonstrating superior coverage, efficiency, and reproducibility for large-scale robustness evaluation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes