CLAIFeb 21, 2025

Control Illusion: The Failure of Instruction Hierarchies in Large Language Models

arXiv:2502.15851v325 citationsh-index: 30
Originality Incremental advance
AI Analysis

This reveals a critical failure in control mechanisms for LLM deployment, impacting developers and users who rely on hierarchical instructions for safety and functionality.

The paper tackles the problem of unreliable hierarchical instruction schemes in large language models (LLMs), finding that models struggle with consistent prioritization, with system/user roles failing to establish reliable control, while natural social hierarchies are more effective.

Large language models (LLMs) are increasingly deployed with hierarchical instruction schemes, where certain instructions (e.g., system-level directives) are expected to take precedence over others (e.g., user messages). Yet, we lack a systematic understanding of how effectively these hierarchical control mechanisms work. We introduce a systematic evaluation framework based on constraint prioritization to assess how well LLMs enforce instruction hierarchies. Our experiments across six state-of-the-art LLMs reveal that models struggle with consistent instruction prioritization, even for simple formatting conflicts. We find that the widely-adopted system/user prompt separation fails to establish a reliable instruction hierarchy, and models exhibit strong inherent biases toward certain constraint types regardless of their priority designation. We find that LLMs more reliably obey constraints framed through natural social hierarchies (e.g., authority, expertise, consensus) than system/user roles, which suggests that pretraining-derived social structures act as latent control priors, with potentially stronger influence than post-training guardrails.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes