AICRHCJul 17, 2025

Manipulation Attacks by Misaligned AI: Risk Analysis and Safety Case Framework

arXiv:2507.12872v13 citationsh-index: 1
Originality Synthesis-oriented
AI Analysis

It tackles the problem of AI manipulation risks for AI companies and safety governance, but it is incremental as it builds on existing safety concerns without new empirical data.

The paper addresses the threat of manipulation attacks by misaligned AI systems on human employees, highlighting the lack of systematic frameworks for risk assessment and mitigation. It introduces a safety case framework with three core arguments (inability, control, trustworthiness) to help AI companies evaluate and reduce these risks before deployment.

Frontier AI systems are rapidly advancing in their capabilities to persuade, deceive, and influence human behaviour, with current models already demonstrating human-level persuasion and strategic deception in specific contexts. Humans are often the weakest link in cybersecurity systems, and a misaligned AI system deployed internally within a frontier company may seek to undermine human oversight by manipulating employees. Despite this growing threat, manipulation attacks have received little attention, and no systematic framework exists for assessing and mitigating these risks. To address this, we provide a detailed explanation of why manipulation attacks are a significant threat and could lead to catastrophic outcomes. Additionally, we present a safety case framework for manipulation risk, structured around three core lines of argument: inability, control, and trustworthiness. For each argument, we specify evidence requirements, evaluation methodologies, and implementation considerations for direct application by AI companies. This paper provides the first systematic methodology for integrating manipulation risk into AI safety governance, offering AI companies a concrete foundation to assess and mitigate these threats before deployment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes