Agentic Reasoning for Robust Vision Systems via Increased Test-Time Compute
This addresses robustness issues in vision systems for domains like remote sensing and medical diagnosis, but it is incremental as it builds on existing vision-language models and vision systems.
The paper tackles the problem of achieving broad robustness in vision systems for high-stakes domains without retraining, by proposing a training-free agentic reasoning framework that achieves up to 40% absolute accuracy gains on challenging benchmarks.
Developing trustworthy intelligent vision systems for high-stakes domains, \emph{e.g.}, remote sensing and medical diagnosis, demands broad robustness without costly retraining. We propose \textbf{Visual Reasoning Agent (VRA)}, a training-free, agentic reasoning framework that wraps off-the-shelf vision-language models \emph{and} pure vision systems in a \emph{Think--Critique--Act} loop. While VRA incurs significant additional test-time computation, it achieves up to 40\% absolute accuracy gains on challenging visual reasoning benchmarks. Future work will optimize query routing and early stopping to reduce inference overhead while preserving reliability in vision tasks.