Should LLM Safety Be More Than Refusing Harmful Instructions?
This work addresses safety issues for LLM users in scenarios involving encrypted or obfuscated text, but it is incremental as it builds on existing safety frameworks.
The paper tackles the problem of LLM safety in long-tail encrypted texts, finding that models capable of decrypting ciphers are vulnerable to mismatched-generalization attacks, leading to unsafe responses or over-refusal in at least one safety dimension.
This paper presents a systematic evaluation of Large Language Models' (LLMs) behavior on long-tail distributed (encrypted) texts and their safety implications. We introduce a two-dimensional framework for assessing LLM safety: (1) instruction refusal-the ability to reject harmful obfuscated instructions, and (2) generation safety-the suppression of generating harmful responses. Through comprehensive experiments, we demonstrate that models that possess capabilities to decrypt ciphers may be susceptible to mismatched-generalization attacks: their safety mechanisms fail on at least one safety dimension, leading to unsafe responses or over-refusal. Based on these findings, we evaluate a number of pre-LLM and post-LLM safeguards and discuss their strengths and limitations. This work contributes to understanding the safety of LLM in long-tail text scenarios and provides directions for developing robust safety mechanisms.