Consistency of Large Reasoning Models Under Multi-Turn Attacks
This addresses the robustness issue for users of reasoning models in adversarial settings, but it is incremental as it builds on existing attack and defense methods.
The paper tackled the problem of adversarial robustness in large reasoning models under multi-turn attacks, finding that reasoning provides meaningful but incomplete robustness, with all models exhibiting vulnerabilities and confidence-based defenses failing due to overconfidence from reasoning traces.
Large reasoning models with reasoning capabilities achieve state-of-the-art performance on complex tasks, but their robustness under multi-turn adversarial pressure remains underexplored. We evaluate nine frontier reasoning models under adversarial attacks. Our findings reveal that reasoning confers meaningful but incomplete robustness: most reasoning models studied significantly outperform instruction-tuned baselines, yet all exhibit distinct vulnerability profiles, with misleading suggestions universally effective and social pressure showing model-specific efficacy. Through trajectory analysis, we identify five failure modes (Self-Doubt, Social Conformity, Suggestion Hijacking, Emotional Susceptibility, and Reasoning Fatigue) with the first two accounting for 50% of failures. We further demonstrate that Confidence-Aware Response Generation (CARG), effective for standard LLMs, fails for reasoning models due to overconfidence induced by extended reasoning traces; counterintuitively, random confidence embedding outperforms targeted extraction. Our results highlight that reasoning capabilities do not automatically confer adversarial robustness and that confidence-based defenses require fundamental redesign for reasoning models.