Soft Begging: Modular and Efficient Shielding of LLMs against Prompt Injection and Jailbreaking based on Prompt Tuning
This addresses security vulnerabilities in LLMs for application developers, though it appears incremental as it builds on existing prompt tuning techniques.
The paper tackles the problem of protecting large language models from prompt injection and jailbreaking attacks by introducing a 'soft begging' method that trains soft prompts to counteract corrupted inputs, with an evaluation showing its effectiveness.
Prompt injection (both direct and indirect) and jailbreaking are now recognized as significant issues for large language models (LLMs), particularly due to their potential for harm in application-integrated contexts. This extended abstract explores a novel approach to protecting LLMs from such attacks, termed "soft begging." This method involves training soft prompts to counteract the effects of corrupted prompts on the LLM's output. We provide an overview of prompt injections and jailbreaking, introduce the theoretical basis of the "soft begging" technique, and discuss an evaluation of its effectiveness.