Bangzhou Xin

CR
h-index9
3papers
8citations
Novelty65%
AI Score51

3 Papers

91.3CRApr 9Code
TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense

Cheng Liu, Xiaolei Liu, Xingyu Li et al.

Existing jailbreak defense paradigms primarily rely on static detection of prompts, outputs, or internal states, often neglecting the dynamic evolution of risk during decoding. This oversight leaves risk signals embedded in decoding trajectories underutilized, constituting a critical blind spot in current defense systems. In this work, we empirically demonstrate that hidden states in critical layers during the decoding phase carry stronger and more stable risk signals than input jailbreak prompts. Specifically, the hidden representations of tokens generated during jailbreak attempts progressively approach high-risk regions in the latent space. Based on this observation, we propose TrajGuard, a training-free, decoding-time defense framework. TrajGuard aggregates hidden-state trajectories via a sliding window to quantify risk in real time, triggering a lightweight semantic adjudication only when risk within a local window persistently exceeds a threshold. This mechanism enables the immediate interruption or constraint of subsequent decoding. Extensive experiments across 12 jailbreak attacks and various open-source LLMs show that TrajGuard achieves an average defense rate of 95%. Furthermore, it reduces detection latency to 5.2 ms/token while maintaining a false positive rate below 1.5%. These results confirm that hidden-state trajectories during decoding can effectively support real-time jailbreak detection, highlighting a promising direction for defenses without model modification.

CRNov 11, 2025Code
LoopLLM: Transferable Energy-Latency Attacks in LLMs via Repetitive Generation

Xingyu Li, Xiaolei Liu, Cheng Liu et al.

As large language models (LLMs) scale, their inference incurs substantial computational resources, exposing them to energy-latency attacks, where crafted prompts induce high energy and latency cost. Existing attack methods aim to prolong output by delaying the generation of termination symbols. However, as the output grows longer, controlling the termination symbols through input becomes difficult, making these methods less effective. Therefore, we propose LoopLLM, an energy-latency attack framework based on the observation that repetitive generation can trigger low-entropy decoding loops, reliably compelling LLMs to generate until their output limits. LoopLLM introduces (1) a repetition-inducing prompt optimization that exploits autoregressive vulnerabilities to induce repetitive generation, and (2) a token-aligned ensemble optimization that aggregates gradients to improve cross-model transferability. Extensive experiments on 12 open-source and 2 commercial LLMs show that LoopLLM significantly outperforms existing methods, achieving over 90% of the maximum output length, compared to 20% for baselines, and improving transferability by around 40% to DeepSeek-V3 and Gemini 2.5 Flash.

CRApr 21, 2023
INK: Inheritable Natural Backdoor Attack Against Model Distillation

Xiaolei Liu, Ming Yi, Kangyi Ding et al.

Deep learning models are vulnerable to backdoor attacks, where attackers inject malicious behavior through data poisoning and later exploit triggers to manipulate deployed models. To improve the stealth and effectiveness of backdoors, prior studies have introduced various imperceptible attack methods targeting both defense mechanisms and manual inspection. However, all poisoning-based attacks still rely on privileged access to the training dataset. Consequently, model distillation using a trusted dataset has emerged as an effective defense against these attacks. To bridge this gap, we introduce INK, an inheritable natural backdoor attack that targets model distillation. The key insight behind INK is the use of naturally occurring statistical features in all datasets, allowing attackers to leverage them as backdoor triggers without direct access to the training data. Specifically, INK employs image variance as a backdoor trigger and enables both clean-image and clean-label attacks by manipulating the labels and image variance in an unauthenticated dataset. Once the backdoor is embedded, it transfers from the teacher model to the student model, even when defenders use a trusted dataset for distillation. Theoretical analysis and experimental results demonstrate the robustness of INK against transformation-based, search-based, and distillation-based defenses. For instance, INK maintains an attack success rate of over 98\% post-distillation, compared to an average success rate of 1.4\% for existing methods.