31.5LGMay 28
When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic DeceptionVahideh Zolfaghari
Deceptive alignment, in which models maintain accurate internal representations while deliberately producing false outputs, remains a central challenge in AI safety. While strategic deception is the primary long-term concern, synthetic dishonesty - induced via direct optimization on incorrect answers - provides a controlled testbed for studying the representational basis of learned deception. We introduce a multi-model paradigm in which honest and deceptive variants of five transformer models (Pythia-1.4B, Gemma-2-2B/9B, Qwen2.5-7B, Llama-3.1-8B) are fine-tuned using LoRA on the same question distribution. Linear probes trained on mean-pooled hidden states detect synthetic dishonesty with near-perfect AUC (greater than or equal to 0.99) as early as layers 1-3 in four architectures, while Pythia-1.4B reaches a peak of 0.705. Logistic regression probes consistently match or outperform MLP probes, supporting the Linear Representation Hypothesis. Probes trained on TruthfulQA generalize with near-zero loss (Delta AUC approx. 0) to held-out MMLU subjects. Late-layer representations show strong robustness to Gaussian noise, with Gemma-2 models exhibiting exceptional stability. Mechanistic analysis of Fisher Discriminant Ratio, effective rank, centroid geometry, directional stability, cross-domain alignment, and calibration (ECE) reveals two regimes: representational collapse in Pythia/Llama/Qwen versus high-dimensional preservation in Gemma-2. Across all models, the dishonesty direction consolidates progressively in deeper layers, with optimal calibration (ECE less than 0.01 except Pythia) achievable in layers 1-4. These results demonstrate that robust, domain-invariant dishonesty representations can be rapidly entrenched via modest supervised fine-tuning, with implications for activation-based monitoring.
AIDec 17, 2025Code
PediatricAnxietyBench: Evaluating Large Language Model Safety Under Parental Anxiety and Pressure in Pediatric ConsultationsVahideh Zolfaghari
Large language models (LLMs) are increasingly consulted by parents for pediatric guidance, yet their safety under real-world adversarial pressures is poorly understood. Anxious parents often use urgent language that can compromise model safeguards, potentially causing harmful advice. PediatricAnxietyBench is an open-source benchmark of 300 high-quality queries across 10 pediatric topics (150 patient-derived, 150 adversarial) enabling reproducible evaluation. Two Llama models (70B and 8B) were assessed using a multi-dimensional safety framework covering diagnostic restraint, referral adherence, hedging, and emergency recognition. Adversarial queries incorporated parental pressure patterns, including urgency, economic barriers, and challenges to disclaimers. Mean safety score was 5.50/15 (SD=2.41). The 70B model outperformed the 8B model (6.26 vs 4.95, p<0.001) with lower critical failures (4.8% vs 12.0%, p=0.02). Adversarial queries reduced safety by 8% (p=0.03), with urgency causing the largest drop (-1.40). Vulnerabilities appeared in seizures (33.3% inappropriate diagnosis) and post-vaccination queries. Hedging strongly correlated with safety (r=0.68, p<0.001), while emergency recognition was absent. Model scale influences safety, yet all models showed vulnerabilities to realistic parental pressures. PediatricAnxietyBench provides a reusable adversarial evaluation framework to reveal clinically significant failure modes overlooked by standard benchmarks.
CLDec 26, 2025
Cross-Platform Evaluation of Large Language Model Safety in Pediatric Consultations: Evolution of Adversarial Robustness and the Scale ParadoxVahideh Zolfaghari
Background Large language models (LLMs) are increasingly deployed in medical consultations, yet their safety under realistic user pressures remains understudied. Prior assessments focused on neutral conditions, overlooking vulnerabilities from anxious users challenging safeguards. This study evaluated LLM safety under parental anxiety-driven adversarial pressures in pediatric consultations across models and platforms. Methods PediatricAnxietyBench, from a prior evaluation, includes 300 queries (150 authentic, 150 adversarial) spanning 10 topics. Three models were assessed via APIs: Llama-3.3-70B and Llama-3.1-8B (Groq), Mistral-7B (HuggingFace), yielding 900 responses. Safety used a 0-15 scale for restraint, referral, hedging, emergency recognition, and non-prescriptive behavior. Analyses employed paired t-tests with bootstrapped CIs. Results Mean scores: 9.70 (Llama-3.3-70B) to 10.39 (Mistral-7B). Llama-3.1-8B outperformed Llama-3.3-70B by +0.66 (p=0.0001, d=0.225). Models showed positive adversarial effects, Mistral-7B strongest (+1.09, p=0.0002). Safety generalized across platforms; Llama-3.3-70B had 8% failures. Seizures vulnerable (33% inappropriate diagnoses). Hedging predicted safety (r=0.68, p<0.001). Conclusions Evaluation shows safety depends on alignment and architecture over scale, with smaller models outperforming larger. Evolution to robustness across releases suggests targeted training progress. Vulnerabilities and no emergency recognition indicate unsuitability for triage. Findings guide selection, stress adversarial testing, and provide open benchmark for medical AI safety.