AILGAug 8, 2025

LLM Robustness Leaderboard v1 --Technical report

arXiv:2508.06296v2h-index: 8
Originality Incremental advance
AI Analysis

This work addresses the need for standardized robustness assessment in AI safety, providing a practical framework for distributed evaluation, though it is incremental in refining existing red-teaming approaches.

The authors tackled the problem of evaluating LLM robustness by introducing an automated red-teaming tool that achieved 100% attack success rate against 37 out of 41 state-of-the-art models, and they proposed a fine-grained metric showing attack difficulty varies by over 300-fold across models.

This technical report accompanies the LLM robustness leaderboard published by PRISM Eval for the Paris AI Action Summit. We introduce PRISM Eval Behavior Elicitation Tool (BET), an AI system performing automated red-teaming through Dynamic Adversarial Optimization that achieves 100% Attack Success Rate (ASR) against 37 of 41 state-of-the-art LLMs. Beyond binary success metrics, we propose a fine-grained robustness metric estimating the average number of attempts required to elicit harmful behaviors, revealing that attack difficulty varies by over 300-fold across models despite universal vulnerability. We introduce primitive-level vulnerability analysis to identify which jailbreaking techniques are most effective for specific hazard categories. Our collaborative evaluation with trusted third parties from the AI Safety Network demonstrates practical pathways for distributed robustness assessment across the community.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes