ASIDE: Architectural Separation of Instructions and Data in Language Models
This addresses a safety problem for users of language models by mitigating malicious attacks, though it is an incremental improvement over existing methods.
The paper tackles the susceptibility of large language models to prompt injection attacks by introducing ASIDE, an architectural element that separates instructions and data at the embedding level using orthogonal rotation, resulting in increased robustness to benchmarks without loss in utility.
Despite their remarkable performance, large language models lack elementary safety features, making them susceptible to numerous malicious attacks. In particular, previous work has identified the absence of an intrinsic separation between instructions and data as a root cause of the success of prompt injection attacks. In this work, we propose a new architectural element, ASIDE, that allows language models to clearly separate instructions and data at the level of embeddings. ASIDE applies an orthogonal rotation to the embeddings of data tokens, thus creating clearly distinct representations of instructions and data tokens without introducing any additional parameters. As we demonstrate experimentally across a range of models, instruction-tuning LLMs with ASIDE (1) leads to highly increased instruction-data separation without a loss in model utility and (2) makes the models more robust to prompt injection benchmarks, even without dedicated safety training. Additionally, we provide insights into the mechanism underlying our method through an analysis of the model representations. The source code and training scripts are openly accessible at https://github.com/egozverev/aside.