LGCRJan 31, 2025

Trading Inference-Time Compute for Adversarial Robustness

arXiv:2501.18841v158 citationsh-index: 18
Originality Incremental advance
AI Analysis

This addresses the problem of adversarial vulnerability in large language models for AI safety and reliability, though it is incremental as it builds on existing compute-based robustness methods.

The paper investigates how increasing inference-time compute in reasoning models like OpenAI o1-preview and o1-mini affects their robustness to adversarial attacks, finding that it generally improves robustness, with attack success rates often approaching zero as compute increases, without adversarial training.

We conduct experiments on the impact of increasing inference-time compute in reasoning models (specifically OpenAI o1-preview and o1-mini) on their robustness to adversarial attacks. We find that across a variety of attacks, increased inference-time compute leads to improved robustness. In many cases (with important exceptions), the fraction of model samples where the attack succeeds tends to zero as the amount of test-time compute grows. We perform no adversarial training for the tasks we study, and we increase inference-time compute by simply allowing the models to spend more compute on reasoning, independently of the form of attack. Our results suggest that inference-time compute has the potential to improve adversarial robustness for Large Language Models. We also explore new attacks directed at reasoning models, as well as settings where inference-time compute does not improve reliability, and speculate on the reasons for these as well as ways to address them.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes