Evaluating The Robustness of Self-Supervised Representations to Background/Foreground Removal
This work addresses the under-explored problem of characterizing SSL representations for robustness to image modifications, which is incremental as it applies existing analysis methods to new scenarios.
The study evaluated how self-supervised learning (SSL) models like DINOv2, MAE, and SwaV respond to foreground and background removal in images across four datasets, finding that not all models effectively separate these components, with specific challenges in texture-focused datasets like DTD.
Despite impressive empirical advances of SSL in solving various tasks, the problem of understanding and characterizing SSL representations learned from input data remains relatively under-explored. We provide a comparative analysis of how the representations produced by SSL models differ when masking parts of the input. Specifically, we considered state-of-the-art SSL pretrained models, such as DINOv2, MAE, and SwaV, and analyzed changes at the representation levels across 4 Image Classification datasets. First, we generate variations of the datasets by applying foreground and background segmentation. Then, we conduct statistical analysis using Canonical Correlation Analysis (CCA) and Centered Kernel Alignment (CKA) to evaluate the robustness of the representations learned in SSL models. Empirically, we show that not all models lead to representations that separate foreground, background, and complete images. Furthermore, we test different masking strategies by occluding the center regions of the images to address cases where foreground and background are difficult. For example, the DTD dataset that focuses on texture rather specific objects.