CALYREX: Cross-Attention LaYeR EXtended Transformers for System Prompt Anchoring
For LLM safety and instruction-following, CALYREX provides a structural fix to prompt injection and instruction erosion, showing robust gains over standard transformers.
CALYREX introduces cross-attention between input and system prompt to structurally isolate instructions, achieving +7.4% on IFEval and +16.3% on multi-turn adherence at 8B scale, while reducing jailbreaking success by 13%.
Modern large language models (LLMs) rely on system prompts to establish behavioral constraints and safety rules. Standard causal self-attention treats privileged instructions and untrusted user content with equal structural priority -- a mismatch that leaves models vulnerable to prompt injection and instruction erosion over extended contexts. We propose CALYREX (Cross-Attention LaYeR EXtended transformers), which utilizes cross-attention between input and system prompt to structurally isolate and anchor the rule. A placement ablation on a 1.5B backbone identifies insertion at the final eighth of layers as optimal, confirmed by mechanistic activation analysis showing behavioral constraints are naturally concentrated there. At 8B scale, controlling for training data, backbone, and parameter budget, CALYREX yields $+7.4\%$ on instruction-following (IFEval) and $+16.3\%$ on multi-turn instruction adherence, while reducing many-shot jailbreaking attack success rate by $13\%$. This advantage appears to widen with model scale, consistent with larger models more effectively utilizing the dedicated routing pathway.