CLAIFeb 24, 2024

Foot In The Door: Understanding Large Language Model Jailbreaking via Cognitive Psychology

arXiv:2402.15690v130 citationsh-index: 9
Originality Incremental advance
AI Analysis

This research addresses the security vulnerability of LLMs to jailbreaking attacks, providing insights into their decision-making mechanisms, though it is incremental as it builds on existing knowledge of jailbreaking weaknesses.

The paper tackles the problem of understanding and exploiting large language model (LLML) jailbreaking by proposing a psychological explanation based on cognitive consistency theory and an automatic black-box method using the Foot-in-the-Door technique, achieving an average success rate of 83.9% across 8 advanced LLMs.

Large Language Models (LLMs) have gradually become the gateway for people to acquire new knowledge. However, attackers can break the model's security protection ("jail") to access restricted information, which is called "jailbreaking." Previous studies have shown the weakness of current LLMs when confronted with such jailbreaking attacks. Nevertheless, comprehension of the intrinsic decision-making mechanism within the LLMs upon receipt of jailbreak prompts is noticeably lacking. Our research provides a psychological explanation of the jailbreak prompts. Drawing on cognitive consistency theory, we argue that the key to jailbreak is guiding the LLM to achieve cognitive coordination in an erroneous direction. Further, we propose an automatic black-box jailbreaking method based on the Foot-in-the-Door (FITD) technique. This method progressively induces the model to answer harmful questions via multi-step incremental prompts. We instantiated a prototype system to evaluate the jailbreaking effectiveness on 8 advanced LLMs, yielding an average success rate of 83.9%. This study builds a psychological perspective on the explanatory insights into the intrinsic decision-making logic of LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes