PICO: Secure Transformers via Robust Prompt Isolation and Cybersecurity Oversight
This addresses security vulnerabilities in AI systems for users relying on safe and reliable response generation, but it appears incremental as it builds on existing transformer and MoE architectures.
The paper tackles the problem of prompt injection attacks in transformers by proposing the PICO framework, which structurally separates system instructions from user inputs and integrates security mechanisms, but no concrete performance numbers are provided.
We propose a robust transformer architecture designed to prevent prompt injection attacks and ensure secure, reliable response generation. Our PICO (Prompt Isolation and Cybersecurity Oversight) framework structurally separates trusted system instructions from untrusted user inputs through dual channels that are processed independently and merged only by a controlled, gated fusion mechanism. In addition, we integrate a specialized Security Expert Agent within a Mixture-of-Experts (MoE) framework and incorporate a Cybersecurity Knowledge Graph (CKG) to supply domain-specific reasoning. Our training design further ensures that the system prompt branch remains immutable while the rest of the network learns to handle adversarial inputs safely. This PICO framework is presented via a general mathematical formulation, then elaborated in terms of the specifics of transformer architecture, and fleshed out via hypothetical case studies including Policy Puppetry attacks. While the most effective implementation may involve training transformers in a PICO-based way from scratch, we also present a cost-effective fine-tuning approach.