CLLGApr 9, 2025

Bypassing Safety Guardrails in LLMs Using Humor

arXiv:2504.06577v13 citationsh-index: 10
Originality Incremental advance
AI Analysis

This reveals a novel vulnerability in LLM safety mechanisms, posing risks for users relying on these guardrails.

The paper demonstrates that humor-based prompts can bypass safety guardrails in large language models (LLMs) without modifying unsafe requests, achieving effectiveness across multiple models while showing that both insufficient and excessive humor reduce success.

In this paper, we show it is possible to bypass the safety guardrails of large language models (LLMs) through a humorous prompt including the unsafe request. In particular, our method does not edit the unsafe request and follows a fixed template -- it is simple to implement and does not need additional LLMs to craft prompts. Extensive experiments show the effectiveness of our method across different LLMs. We also show that both removing and adding more humor to our method can reduce its effectiveness -- excessive humor possibly distracts the LLM from fulfilling its unsafe request. Thus, we argue that LLM jailbreaking occurs when there is a proper balance between focus on the unsafe request and presence of humor.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes