CLAICRMar 23, 2025

STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models

arXiv:2503.17932v16 citationsh-index: 19
Originality Incremental advance
AI Analysis

This addresses the vulnerability of LLMs to safety circumvention, offering a practical solution for real-world deployment, though it appears incremental as it builds on existing alignment capabilities.

The paper tackles the problem of jailbreak attacks on large language models by introducing STShield, a lightweight framework that uses a single-token sentinel for real-time detection, achieving superior defense performance with minimal computational overhead.

Large Language Models (LLMs) have become increasingly vulnerable to jailbreak attacks that circumvent their safety mechanisms. While existing defense methods either suffer from adaptive attacks or require computationally expensive auxiliary models, we present STShield, a lightweight framework for real-time jailbroken judgement. STShield introduces a novel single-token sentinel mechanism that appends a binary safety indicator to the model's response sequence, leveraging the LLM's own alignment capabilities for detection. Our framework combines supervised fine-tuning on normal prompts with adversarial training using embedding-space perturbations, achieving robust detection while preserving model utility. Extensive experiments demonstrate that STShield successfully defends against various jailbreak attacks, while maintaining the model's performance on legitimate queries. Compared to existing approaches, STShield achieves superior defense performance with minimal computational overhead, making it a practical solution for real-world LLM deployment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes