CLAIDec 26, 2025

Cross-Platform Evaluation of Large Language Model Safety in Pediatric Consultations: Evolution of Adversarial Robustness and the Scale Paradox

arXiv:2601.09721v1h-index: 1
Originality Incremental advance
AI Analysis

This work addresses safety concerns for medical AI deployments in pediatric care, particularly under realistic user pressures, though it is incremental as it builds on prior benchmarks and focuses on specific models and scenarios.

This study evaluated the safety of large language models in pediatric consultations under adversarial pressures from anxious users, finding that safety depends more on alignment and architecture than scale, with smaller models like Llama-3.1-8B outperforming larger ones like Llama-3.3-70B by +0.66 points on a 0-15 scale, and vulnerabilities such as 33% inappropriate diagnoses for seizures indicate unsuitability for triage.

Background Large language models (LLMs) are increasingly deployed in medical consultations, yet their safety under realistic user pressures remains understudied. Prior assessments focused on neutral conditions, overlooking vulnerabilities from anxious users challenging safeguards. This study evaluated LLM safety under parental anxiety-driven adversarial pressures in pediatric consultations across models and platforms. Methods PediatricAnxietyBench, from a prior evaluation, includes 300 queries (150 authentic, 150 adversarial) spanning 10 topics. Three models were assessed via APIs: Llama-3.3-70B and Llama-3.1-8B (Groq), Mistral-7B (HuggingFace), yielding 900 responses. Safety used a 0-15 scale for restraint, referral, hedging, emergency recognition, and non-prescriptive behavior. Analyses employed paired t-tests with bootstrapped CIs. Results Mean scores: 9.70 (Llama-3.3-70B) to 10.39 (Mistral-7B). Llama-3.1-8B outperformed Llama-3.3-70B by +0.66 (p=0.0001, d=0.225). Models showed positive adversarial effects, Mistral-7B strongest (+1.09, p=0.0002). Safety generalized across platforms; Llama-3.3-70B had 8% failures. Seizures vulnerable (33% inappropriate diagnoses). Hedging predicted safety (r=0.68, p<0.001). Conclusions Evaluation shows safety depends on alignment and architecture over scale, with smaller models outperforming larger. Evolution to robustness across releases suggests targeted training progress. Vulnerabilities and no emergency recognition indicate unsuitability for triage. Findings guide selection, stress adversarial testing, and provide open benchmark for medical AI safety.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes