LG CVFeb 3

Robust Representation Learning in Masked Autoencoders

Anika Shrivastava, Renu Rameshan, Samar Agnihotri

arXiv:2602.03531v11.4h-index: 8

Originality Synthesis-oriented

AI Analysis

This work provides insights into the robustness of MAE representations for computer vision applications, though it is incremental as it analyzes rather than improves the method.

The researchers investigated why Masked Autoencoders (MAEs) perform well in image classification and found that their learned representations are robust to degradations like blur and occlusions, maintaining good classification performance. They showed that MAEs build class-aware latent spaces with increasingly separable class embeddings across network depth and exhibit persistent global attention patterns.

Masked Autoencoders (MAEs) achieve impressive performance in image classification tasks, yet the internal representations they learn remain less understood. This work started as an attempt to understand the strong downstream classification performance of MAE. In this process we discover that representations learned with the pretraining and fine-tuning, are quite robust - demonstrating a good classification performance in the presence of degradations, such as blur and occlusions. Through layer-wise analysis of token embeddings, we show that pretrained MAE progressively constructs its latent space in a class-aware manner across network depth: embeddings from different classes lie in subspaces that become increasingly separable. We further observe that MAE exhibits early and persistent global attention across encoder layers, in contrast to standard Vision Transformers (ViTs). To quantify feature robustness, we introduce two sensitivity indicators: directional alignment between clean and perturbed embeddings, and head-wise retention of active features under degradations. These studies help establish the robust classification performance of MAEs.

View on arXiv PDF

Similar