Towards Fine-grained Visual Representations by Combining Contrastive Learning with Image Reconstruction and Attention-weighted Pooling
This work addresses a specific limitation in self-supervised learning for fine-grained visual recognition, representing an incremental improvement over existing methods.
The paper tackled the problem of capturing fine-grained visual features in self-supervised image representations by introducing ConRec, which combines contrastive learning with image reconstruction and attention-weighted pooling, resulting in improved performance over SimCLR on fine-grained classification datasets.
This paper presents Contrastive Reconstruction, ConRec - a self-supervised learning algorithm that obtains image representations by jointly optimizing a contrastive and a self-reconstruction loss. We showcase that state-of-the-art contrastive learning methods (e.g. SimCLR) have shortcomings to capture fine-grained visual features in their representations. ConRec extends the SimCLR framework by adding (1) a self-reconstruction task and (2) an attention mechanism within the contrastive learning task. This is accomplished by applying a simple encoder-decoder architecture with two heads. We show that both extensions contribute towards an improved vector representation for images with fine-grained visual features. Combining those concepts, ConRec outperforms SimCLR and SimCLR with Attention-Pooling on fine-grained classification datasets.