AIMay 19, 2025

Adversarial Testing in LLMs: Insights into Decision-Making Vulnerabilities

arXiv:2505.13195v14 citationsh-index: 16
Originality Incremental advance
AI Analysis

This work addresses AI safety and alignment by diagnosing decision-making weaknesses in LLMs, though it is incremental as it builds on existing evaluation methods with a new focus on adversarial conditions.

The paper tackled the problem of evaluating decision-making vulnerabilities in LLMs by introducing an adversarial testing framework, revealing model-specific susceptibilities to manipulation and rigidity in strategy adaptation across tasks like the two-armed bandit and Multi-Round Trust Task.

As Large Language Models (LLMs) become increasingly integrated into real-world decision-making systems, understanding their behavioural vulnerabilities remains a critical challenge for AI safety and alignment. While existing evaluation metrics focus primarily on reasoning accuracy or factual correctness, they often overlook whether LLMs are robust to adversarial manipulation or capable of using adaptive strategy in dynamic environments. This paper introduces an adversarial evaluation framework designed to systematically stress-test the decision-making processes of LLMs under interactive and adversarial conditions. Drawing on methodologies from cognitive psychology and game theory, our framework probes how models respond in two canonical tasks: the two-armed bandit task and the Multi-Round Trust Task. These tasks capture key aspects of exploration-exploitation trade-offs, social cooperation, and strategic flexibility. We apply this framework to several state-of-the-art LLMs, including GPT-3.5, GPT-4, Gemini-1.5, and DeepSeek-V3, revealing model-specific susceptibilities to manipulation and rigidity in strategy adaptation. Our findings highlight distinct behavioral patterns across models and emphasize the importance of adaptability and fairness recognition for trustworthy AI deployment. Rather than offering a performance benchmark, this work proposes a methodology for diagnosing decision-making weaknesses in LLM-based agents, providing actionable insights for alignment and safety research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes