CRApr 8

MirageBackdoor: A Stealthy Attack that Induces Think-Well-Answer-Wrong Reasoning

arXiv:2604.0684088.1
Predicted impact top 7% in CR · last 90 daysOriginality Highly original
AI Analysis

This poses a critical challenge to safety guardrails for AI systems relying on reasoning, as it is a novel attack method for a known bottleneck in security.

The paper tackles the problem of backdoor attacks on Chain-of-Thought prompting in Large Language Models by introducing MirageBackdoor, which achieves over 90% attack success rate with a 5% poison ratio while preserving clean reasoning traces to evade detection.

While Chain-of-Thought (CoT) prompting has become a standard paradigm for eliciting complex reasoning capabilities in Large Language Models, it inadvertently exposes a new attack surface for backdoor attacks. Existing CoT backdoor attacks typically manipulate the intermediate reasoning steps to steer the model toward incorrect answers. However, these corrupted reasoning traces are readily detected by prevalent process-monitoring defenses. To address this limitation, we introduce MirageBackdoor(MirageBD), the first backdoor attack to achieve Think Well but Answer Wrong. By unlocking the model's post-output space alongside a tailored training procedure, MirageBD enables the triggered model to preserve clean CoTs while selectively steering the final answer toward a specific target, significantly enhancing the stealthiness of the attack. Experiments show that MirageBD generally achieves over 90% attack success rate across four datasets and five models with a poison ratio of only 5%. Moreover, even under rigorous evaluations such as trigger perturbations and CoT-based detection, MirageBD maintains robust performance and stealthiness, posing a critical challenge to existing safety guardrails.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes