SOS! Soft Prompt Attack Against Open-Source Large Language Models
This work addresses security vulnerabilities in open-source LLMs for users and developers, though it is incremental as it builds on existing attack methods.
The authors tackled the security risks of open-source large language models (LLMs) by introducing SOS, a low-computational training time attack that does not require clean data or weight modifications, and it effectively addresses backdoor, jailbreak, and prompt stealing attacks across all evaluated targets.
Open-source large language models (LLMs) have become increasingly popular among both the general public and industry, as they can be customized, fine-tuned, and freely used. However, some open-source LLMs require approval before usage, which has led to third parties publishing their own easily accessible versions. Similarly, third parties have been publishing fine-tuned or quantized variants of these LLMs. These versions are particularly appealing to users because of their ease of access and reduced computational resource demands. This trend has increased the risk of training time attacks, compromising the integrity and security of LLMs. In this work, we present a new training time attack, SOS, which is designed to be low in computational demand and does not require clean data or modification of the model weights, thereby maintaining the model's utility intact. The attack addresses security issues in various scenarios, including the backdoor attack, jailbreak attack, and prompt stealing attack. Our experimental findings demonstrate that the proposed attack is effective across all evaluated targets. Furthermore, we present the other side of our SOS technique, namely the copyright token -- a novel technique that enables users to mark their copyrighted content and prevent models from using it.