CVNov 24, 2025

Benchmarking Corruption Robustness of LVLMs: A Discriminative Benchmark and Robustness Alignment Metric

arXiv:2511.19032v12 citations
Originality Incremental advance
AI Analysis

This work addresses the need for better evaluation methods for corruption robustness in LVLMs, which is incremental as it builds on existing benchmarks and metrics.

The paper tackled the problem of evaluating the robustness of large vision-language models under visual corruptions by introducing Bench-C, a benchmark with discriminative samples, and the Robustness Alignment Score (RAS), a metric that measures degradation in prediction structure, revealing patterns like erroneous confidence and hesitation in models.

Despite the remarkable reasoning abilities of large vision-language models (LVLMs), their robustness under visual corruptions remains insufficiently studied. Existing evaluation paradigms exhibit two major limitations: 1) the dominance of low-discriminative samples in current datasets masks the real robustness gap between models; and 2) conventional accuracy-based metric fail to capture the degradation of the underlying prediction structure. To bridge these gaps, we introduce Bench-C, a comprehensive benchmark emphasizing discriminative samples for assessing corruption robustness, where a selection strategy is proposed to jointly consider the prediction inconsistency under corruption and the semantic diversity. Furthermore, we propose the Robustness Alignment Score (RAS), a unified metric that measures degradation in logit-level prediction structure by considering the shifts in prediction uncertainty and calibration alignment. Comprehensive experiments and analysis reveal several interesting findings: 1) model behaviors exhibit distinguish patterns under corruptions, such as erroneous confidence and hesitation; 2) despite subtle corruption may lead to a slight accuracy gain, the overall prediction structure still degrades; 3) by decomposing corruption robustness into destructive and corrective components, the distinct failure and recovery patterns across models can be revealed.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes