CLCRLGJan 13

STAR: Detecting Inference-time Backdoors in LLM Reasoning via State-Transition Amplification Ratio

arXiv:2601.08511v1h-index: 3
Originality Highly original
AI Analysis

This addresses a critical security vulnerability for users of LLMs with reasoning capabilities, offering a robust detection method against adaptive attacks.

The paper tackles the problem of detecting inference-time backdoors in LLM reasoning mechanisms like Chain-of-Thought, which evade conventional detection by generating linguistically coherent malicious paths, and achieves near-perfect performance with AUROC ≈ 1.0 and 42× greater efficiency than baselines.

Recent LLMs increasingly integrate reasoning mechanisms like Chain-of-Thought (CoT). However, this explicit reasoning exposes a new attack surface for inference-time backdoors, which inject malicious reasoning paths without altering model parameters. Because these attacks generate linguistically coherent paths, they effectively evade conventional detection. To address this, we propose STAR (State-Transition Amplification Ratio), a framework that detects backdoors by analyzing output probability shifts. STAR exploits the statistical discrepancy where a malicious input-induced path exhibits high posterior probability despite a low prior probability in the model's general knowledge. We quantify this state-transition amplification and employ the CUSUM algorithm to detect persistent anomalies. Experiments across diverse models (8B-70B) and five benchmark datasets demonstrate that STAR exhibits robust generalization capabilities, consistently achieving near-perfect performance (AUROC $\approx$ 1.0) with approximately $42\times$ greater efficiency than existing baselines. Furthermore, the framework proves robust against adaptive attacks attempting to bypass detection.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes