CVAILGOct 20, 2022

i-MAE: Are Latent Representations in Masked Autoencoders Linearly Separable?

arXiv:2210.11470v211 citationsh-index: 32Has Code
Originality Incremental advance
AI Analysis

This work addresses the need for better interpretability and performance in self-supervised vision models, though it appears incremental as it builds upon existing MAE frameworks.

The paper tackles the problem of understanding and enhancing latent representations in Masked Autoencoders (MAE) by proposing an interactive framework (i-MAE) that improves representation capability through two-way reconstruction and semantics-enhanced sampling, achieving better performance on datasets like CIFAR-10/100 and ImageNet-1K.

Masked image modeling (MIM) has been recognized as a strong self-supervised pre-training approach in the vision domain. However, the mechanism and properties of the learned representations by such a scheme, as well as how to further enhance the representations are so far not well-explored. In this paper, we aim to explore an interactive Masked Autoencoders (i-MAE) framework to enhance the representation capability from two aspects: (1) employing a two-way image reconstruction and a latent feature reconstruction with distillation loss to learn better features; (2) proposing a semantics-enhanced sampling strategy to boost the learned semantics in MAE. Upon the proposed i-MAE architecture, we can address two critical questions to explore the behaviors of the learned representations in MAE: (1) Whether the separability of latent representations in Masked Autoencoders is helpful for model performance? We study it by forcing the input as a mixture of two images instead of one. (2) Whether we can enhance the representations in the latent feature space by controlling the degree of semantics during sampling on Masked Autoencoders? To this end, we propose a sampling strategy within a mini-batch based on the semantics of training samples to examine this aspect. Extensive experiments are conducted on CIFAR-10/100, Tiny-ImageNet and ImageNet-1K to verify the observations we discovered. Furthermore, in addition to qualitatively analyzing the characteristics of the latent representations, we examine the existence of linear separability and the degree of semantics in the latent space by proposing two evaluation schemes. The surprising and consistent results demonstrate that i-MAE is a superior framework design for understanding MAE frameworks, as well as achieving better representational ability. Code is available at https://github.com/vision-learning-acceleration-lab/i-mae.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes