CVLGJun 20, 2025

LAION-C: An Out-of-Distribution Benchmark for Web-Scale Vision Models

arXiv:2506.16950v15 citationsh-index: 18ICML
Originality Incremental advance
AI Analysis

This addresses the need for accurate OOD robustness evaluation in computer vision for researchers and practitioners, but it is incremental as it builds on existing benchmark concepts.

The paper tackles the problem that existing out-of-distribution (OOD) benchmarks like ImageNet-C are no longer effective for evaluating web-scale vision models, as they include corruptions already seen in training data, and introduces LAION-C, a new benchmark with six novel distortion types designed to be truly OOD, which significantly challenges state-of-the-art models including MLLMs like Gemini and GPT-4o, with results showing models now matching or outperforming human observers in robustness.

Out-of-distribution (OOD) robustness is a desired property of computer vision models. Improving model robustness requires high-quality signals from robustness benchmarks to quantify progress. While various benchmark datasets such as ImageNet-C were proposed in the ImageNet era, most ImageNet-C corruption types are no longer OOD relative to today's large, web-scraped datasets, which already contain common corruptions such as blur or JPEG compression artifacts. Consequently, these benchmarks are no longer well-suited for evaluating OOD robustness in the era of web-scale datasets. Indeed, recent models show saturating scores on ImageNet-era OOD benchmarks, indicating that it is unclear whether models trained on web-scale datasets truly become better at OOD generalization or whether they have simply been exposed to the test distortions during training. To address this, we introduce LAION-C as a benchmark alternative for ImageNet-C. LAION-C consists of six novel distortion types specifically designed to be OOD, even for web-scale datasets such as LAION. In a comprehensive evaluation of state-of-the-art models, we find that the LAION-C dataset poses significant challenges to contemporary models, including MLLMs such as Gemini and GPT-4o. We additionally conducted a psychophysical experiment to evaluate the difficulty of our corruptions for human observers, enabling a comparison of models to lab-quality human robustness data. We observe a paradigm shift in OOD generalization: from humans outperforming models, to the best models now matching or outperforming the best human observers.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes