CLJul 17, 2025

Paper Summary Attack: Jailbreaking LLMs through LLM Safety Papers

arXiv:2507.13474v13 citationsh-index: 10Has Code
Originality Incremental advance
AI Analysis

This reveals a critical security flaw in LLMs, potentially impacting all users reliant on their safety, and is incremental as it builds on known trust biases.

The paper tackles the vulnerability of LLMs to trusting authoritative sources like academic papers, proposing the Paper Summary Attack (PSA) method that achieves a 97% attack success rate on Claude3.5-Sonnet and 98% on Deepseek-R1.

The safety of large language models (LLMs) has garnered significant research attention. In this paper, we argue that previous empirical studies demonstrate LLMs exhibit a propensity to trust information from authoritative sources, such as academic papers, implying new possible vulnerabilities. To verify this possibility, a preliminary analysis is designed to illustrate our two findings. Based on this insight, a novel jailbreaking method, Paper Summary Attack (\llmname{PSA}), is proposed. It systematically synthesizes content from either attack-focused or defense-focused LLM safety paper to construct an adversarial prompt template, while strategically infilling harmful query as adversarial payloads within predefined subsections. Extensive experiments show significant vulnerabilities not only in base LLMs, but also in state-of-the-art reasoning model like Deepseek-R1. PSA achieves a 97\% attack success rate (ASR) on well-aligned models like Claude3.5-Sonnet and an even higher 98\% ASR on Deepseek-R1. More intriguingly, our work has further revealed diametrically opposed vulnerability bias across different base models, and even between different versions of the same model, when exposed to either attack-focused or defense-focused papers. This phenomenon potentially indicates future research clues for both adversarial methodologies and safety alignment.Code is available at https://github.com/233liang/Paper-Summary-Attack

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes