AIJul 19, 2025

Automated Safety Evaluations Across 20 Large Language Models: The Aymara LLM Risk and Responsibility Matrix

arXiv:2507.14719v11 citationsh-index: 1
Originality Incremental advance
AI Analysis

This addresses the need for rigorous safety assessments in real-world AI applications, though it is incremental as it builds on existing evaluation methods with a new tool.

The paper tackled the problem of scalable safety evaluation for large language models (LLMs) by introducing Aymara AI, a platform that transforms safety policies into adversarial prompts and scores model responses, resulting in evaluations of 20 LLMs across 10 domains with mean safety scores ranging from 86.2% to 52.4%.

As large language models (LLMs) become increasingly integrated into real-world applications, scalable and rigorous safety evaluation is essential. This paper introduces Aymara AI, a programmatic platform for generating and administering customized, policy-grounded safety evaluations. Aymara AI transforms natural-language safety policies into adversarial prompts and scores model responses using an AI-based rater validated against human judgments. We demonstrate its capabilities through the Aymara LLM Risk and Responsibility Matrix, which evaluates 20 commercially available LLMs across 10 real-world safety domains. Results reveal wide performance disparities, with mean safety scores ranging from 86.2% to 52.4%. While models performed well in well-established safety domains such as Misinformation (mean = 95.7%), they consistently failed in more complex or underspecified domains, notably Privacy & Impersonation (mean = 24.3%). Analyses of Variance confirmed that safety scores differed significantly across both models and domains (p < .05). These findings underscore the inconsistent and context-dependent nature of LLM safety and highlight the need for scalable, customizable tools like Aymara AI to support responsible AI development and oversight.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes