IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs
This addresses security vulnerabilities in frontier LLMs for developers and users, though it is incremental as it builds on existing reinforcement learning and dataset methods.
The paper tackles the problem of training LLMs to robustly prioritize instruction hierarchy (IH) to defend against security threats like jailbreaks, by introducing the IH-Challenge dataset. Fine-tuning GPT-5-Mini on it improves IH robustness by +10.0% on average across benchmarks, reduces unsafe behavior from 6.6% to 0.7%, and saturates an internal evaluation with minimal capability regression.
Instruction hierarchy (IH) defines how LLMs prioritize system, developer, user, and tool instructions under conflict, providing a concrete, trust-ordered policy for resolving instruction conflicts. IH is key to defending against jailbreaks, system prompt extractions, and agentic prompt injections. However, robust IH behavior is difficult to train: IH failures can be confounded with instruction-following failures, conflicts can be nuanced, and models can learn shortcuts such as overrefusing. We introduce IH-Challenge, a reinforcement learning training dataset, to address these difficulties. Fine-tuning GPT-5-Mini on IH-Challenge with online adversarial example generation improves IH robustness by +10.0% on average across 16 in-distribution, out-of-distribution, and human red-teaming benchmarks (84.1% to 94.1%), reduces unsafe behavior from 6.6% to 0.7% while improving helpfulness on general safety evaluations, and saturates an internal static agentic prompt injection evaluation, with minimal capability regression. We release the IH-Challenge dataset (https://huggingface.co/datasets/openai/ih-challenge) to support future research on robust instruction hierarchy.