PUZZLED: Jailbreaking LLMs through Word-Based Puzzles
This addresses safety concerns for LLM deployments by revealing vulnerabilities through a novel attack method, though it is incremental as it builds on existing jailbreak research.
The paper tackles the problem of jailbreaking large language models (LLMs) by introducing PUZZLED, a method that masks harmful instructions as word puzzles (e.g., word search, anagram, crossword) for LLMs to solve, achieving an average attack success rate of 88.8%, with 96.5% on GPT-4.1 and 92.3% on Claude 3.7 Sonnet.
As large language models (LLMs) are increasingly deployed across diverse domains, ensuring their safety has become a critical concern. In response, studies on jailbreak attacks have been actively growing. Existing approaches typically rely on iterative prompt engineering or semantic transformations of harmful instructions to evade detection. In this work, we introduce PUZZLED, a novel jailbreak method that leverages the LLM's reasoning capabilities. It masks keywords in a harmful instruction and presents them as word puzzles for the LLM to solve. We design three puzzle types-word search, anagram, and crossword-that are familiar to humans but cognitively demanding for LLMs. The model must solve the puzzle to uncover the masked words and then proceed to generate responses to the reconstructed harmful instruction. We evaluate PUZZLED on five state-of-the-art LLMs and observe a high average attack success rate (ASR) of 88.8%, specifically 96.5% on GPT-4.1 and 92.3% on Claude 3.7 Sonnet. PUZZLED is a simple yet powerful attack that transforms familiar puzzles into an effective jailbreak strategy by harnessing LLMs' reasoning capabilities.