CVApr 5, 2025

Resilience of Vision Transformers for Domain Generalisation in the Presence of Out-of-Distribution Noisy Images

arXiv:2504.04225v13.6h-index: 66

Originality Incremental advance

AI Analysis

It addresses the challenge of making AI models robust in real-world scenarios with unpredictable data shifts, offering a scalable benchmark and insights for reliable deployment, though it is incremental in building on existing vision transformer methods.

This paper tackles the problem of domain generalisation in AI models by evaluating vision transformers' resilience to out-of-distribution noisy images, showing that BEIT maintains high accuracy (e.g., 94% on PACS) despite significant occlusions and outperforms other models by up to 37%.

Modern AI models excel in controlled settings but often fail in real-world scenarios where data distributions shift unpredictably - a challenge known as domain generalisation (DG). This paper tackles this limitation by rigorously evaluating vision tramsformers, specifically the BEIT architecture which is a model pre-trained with masked image modelling (MIM), against synthetic out-of-distribution (OOD) benchmarks designed to mimic real-world noise and occlusions. We introduce a novel framework to generate OOD test cases by strategically masking object regions in images using grid patterns (25\%, 50\%, 75\% occlusion) and leveraging cutting-edge zero-shot segmentation via Segment Anything and Grounding DINO to ensure precise object localisation. Experiments across three benchmarks (PACS, Office-Home, DomainNet) demonstrate BEIT's known robustness while maintaining 94\% accuracy on PACS and 87\% on Office-Home, despite significant occlusions, outperforming CNNs and other vision transformers by margins of up to 37\%. Analysis of self-attention distances reveals that the BEIT dependence on global features correlates with its resilience. Furthermore, our synthetic benchmarks expose critical failure modes: performance degrades sharply when occlusions disrupt object shapes e.g. 68\% drop for external grid masking vs. 22\% for internal masking. This work provides two key advances (1) a scalable method to generate OOD benchmarks using controllable noise, and (2) empirical evidence that MIM and self-attention mechanism in vision transformers enhance DG by learning invariant features. These insights bridge the gap between lab-trained models and real-world deployment that offer a blueprint for building AI systems that generalise reliably under uncertainty.

View on arXiv PDF

Similar