HumorReject: Decoupling LLM Safety from Refusal Prefix via A Little Humor
This addresses safety vulnerabilities in LLMs for users and developers, though it appears incremental as it builds on existing safety methods by modifying the refusal mechanism.
The paper tackles the vulnerability of LLMs to prefix injection attacks due to reliance on explicit refusal prefixes for safety, introducing HumorReject, a data-driven approach that uses humor as an indirect refusal strategy to decouple safety from prefixes, resulting in improved robustness against attacks and reduced over-defense issues.
Large Language Models (LLMs) commonly rely on explicit refusal prefixes for safety, making them vulnerable to prefix injection attacks. We introduce HumorReject, a novel data-driven approach that reimagines LLM safety by decoupling it from refusal prefixes through humor as an indirect refusal strategy. Rather than explicitly rejecting harmful instructions, HumorReject responds with contextually appropriate humor that naturally defuses potentially dangerous requests. Our approach effectively addresses common "over-defense" issues while demonstrating superior robustness against various attack vectors. Our findings suggest that improvements in training data design can be as important as the alignment algorithm itself in achieving effective LLM safety. The code and dataset are available at https://github.com/wooozihui/HumorReject.