AIETLGFeb 18

DeepContext: Stateful Real-Time Detection of Multi-Turn Adversarial Intent Drift in LLMs

arXiv:2602.16935v14 citationsh-index: 3
Originality Highly original
AI Analysis

This addresses a critical safety gap in LLM deployments by enabling real-time detection of multi-turn adversarial attacks, which is essential for preventing jailbreaks in applications like chatbots and AI assistants, though it is an incremental improvement over existing methods.

The paper tackles the problem of detecting adversarial intent drift in multi-turn dialogues with LLMs, where stateless safety guardrails fail to capture incremental risks across turns. It introduces DeepContext, a stateful monitoring framework that achieves a state-of-the-art F1 score of 0.84, significantly outperforming existing baselines like Llama-Prompt-Guard-2 (0.67) and Granite-Guardian (0.67), while maintaining sub-20ms inference overhead for real-time use.

While Large Language Model (LLM) capabilities have scaled, safety guardrails remain largely stateless, treating multi-turn dialogues as a series of disconnected events. This lack of temporal awareness facilitates a "Safety Gap" where adversarial tactics, like Crescendo and ActorAttack, slowly bleed malicious intent across turn boundaries to bypass stateless filters. We introduce DeepContext, a stateful monitoring framework designed to map the temporal trajectory of user intent. DeepContext discards the isolated evaluation model in favor of a Recurrent Neural Network (RNN) architecture that ingests a sequence of fine-tuned turn-level embeddings. By propagating a hidden state across the conversation, DeepContext captures the incremental accumulation of risk that stateless models overlook. Our evaluation demonstrates that DeepContext significantly outperforms existing baselines in multi-turn jailbreak detection, achieving a state-of-the-art F1 score of 0.84, which represents a substantial improvement over both hyperscaler cloud-provider guardrails and leading open-weight models such as Llama-Prompt-Guard-2 (0.67) and Granite-Guardian (0.67). Furthermore, DeepContext maintains a sub-20ms inference overhead on a T4 GPU, ensuring viability for real-time applications. These results suggest that modeling the sequential evolution of intent is a more effective and computationally efficient alternative to deploying massive, stateless models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes