CRAIApr 7, 2024

Hidden You Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Logic Chain Injection

arXiv:2404.04849v29 citationsh-index: 7
AI Analysis

This addresses a security vulnerability in LLMs by enabling attacks that evade detection by both models and human analysts, representing a novel approach in the field.

The paper tackles the problem of jailbreak attacks on large language models by proposing logic chain injection, a method that hides malicious goals within benign narratives to deceive both LLMs and humans, achieving successful deception in experiments.

Jailbreak attacks on Language Model Models (LLMs) entail crafting prompts aimed at exploiting the models to generate malicious content. Existing jailbreak attacks can successfully deceive the LLMs, however they cannot deceive the human. This paper proposes a new type of jailbreak attacks which can deceive both the LLMs and human (i.e., security analyst). The key insight of our idea is borrowed from the social psychology - that is human are easily deceived if the lie is hidden in truth. Based on this insight, we proposed the logic-chain injection attacks to inject malicious intention into benign truth. Logic-chain injection attack firstly dissembles its malicious target into a chain of benign narrations, and then distribute narrations into a related benign article, with undoubted facts. In this way, newly generate prompt cannot only deceive the LLMs, but also deceive human.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes