CVDec 30, 2022

Improving Visual Representation Learning through Perceptual Understanding

Samyakh Tukra, Frederick Hoffman, Ken Chatfield

arXiv:2212.14504v25.76 citationsh-index: 10

Originality Incremental advance

AI Analysis

This work addresses the challenge of learning higher-level scene features in visual representation learning for computer vision applications, representing an incremental improvement over existing methods.

The paper tackles the problem of improving visual representation learning by extending masked autoencoders (MAE) to incorporate perceptual understanding, resulting in better performance on downstream tasks such as achieving 78.1% top-1 accuracy with linear probing and up to 88.1% with fine-tuning on ImageNet-1K.

We present an extension to masked autoencoders (MAE) which improves on the representations learnt by the model by explicitly encouraging the learning of higher scene-level features. We do this by: (i) the introduction of a perceptual similarity term between generated and real images (ii) incorporating several techniques from the adversarial training literature including multi-scale training and adaptive discriminator augmentation. The combination of these results in not only better pixel reconstruction but also representations which appear to capture better higher-level details within images. More consequentially, we show how our method, Perceptual MAE, leads to better performance when used for downstream tasks outperforming previous methods. We achieve 78.1% top-1 accuracy linear probing on ImageNet-1K and up to 88.1% when fine-tuning, with similar results for other downstream tasks, all without use of additional pre-trained models or data.

View on arXiv PDF

Similar