Control Illusion: The Failure of Instruction Hierarchies in Large Language Models
This reveals a critical failure in control mechanisms for LLM deployment, impacting developers and users who rely on hierarchical instructions for safety and functionality.
The paper tackles the problem of unreliable hierarchical instruction schemes in large language models (LLMs), finding that models struggle with consistent prioritization, with system/user roles failing to establish reliable control, while natural social hierarchies are more effective.
Large language models (LLMs) are increasingly deployed with hierarchical instruction schemes, where certain instructions (e.g., system-level directives) are expected to take precedence over others (e.g., user messages). Yet, we lack a systematic understanding of how effectively these hierarchical control mechanisms work. We introduce a systematic evaluation framework based on constraint prioritization to assess how well LLMs enforce instruction hierarchies. Our experiments across six state-of-the-art LLMs reveal that models struggle with consistent instruction prioritization, even for simple formatting conflicts. We find that the widely-adopted system/user prompt separation fails to establish a reliable instruction hierarchy, and models exhibit strong inherent biases toward certain constraint types regardless of their priority designation. We find that LLMs more reliably obey constraints framed through natural social hierarchies (e.g., authority, expertise, consensus) than system/user roles, which suggests that pretraining-derived social structures act as latent control priors, with potentially stronger influence than post-training guardrails.