CVAIOct 17, 2025

Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models

arXiv:2510.15430v2h-index: 5Has Code
Originality Incremental advance
AI Analysis

This addresses safety risks for users of LVLMs by improving detection of unseen attacks, though it is incremental as it builds on existing detection methods.

The paper tackles the problem of detecting unknown jailbreak attacks in Large Vision-Language Models by proposing a general framework called Learning to Detect (LoD), which achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency.

Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. To address this, existing detection methods either learn attack-specific parameters, which hinders generalization to unseen attacks, or rely on heuristically sound principles, which limit accuracy and efficiency. To overcome these limitations, we propose Learning to Detect (LoD), a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning. This framework includes a Multi-modal Safety Concept Activation Vector module for safety-oriented representation learning and a Safety Pattern Auto-Encoder module for unsupervised attack classification. Extensive experiments show that our method achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency. The code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes