CVMar 6, 2024

LoDisc: Learning Global-Local Discriminative Features for Self-Supervised Fine-Grained Visual Recognition

arXiv:2403.04066v23 citationsh-index: 18IEEE transactions on circuits and systems for video technology (Print)
AI Analysis

This addresses the need for better self-supervised methods in fine-grained visual recognition, though it appears incremental by enhancing existing contrastive learning approaches.

The paper tackled the problem of self-supervised contrastive learning being insufficient for fine-grained visual recognition by proposing a global-local framework with a local discrimination pretext task, resulting in decent improvements across different fine-grained object recognition tasks and effectiveness for general object recognition.

The self-supervised contrastive learning strategy has attracted considerable attention due to its exceptional ability in representation learning. However, current contrastive learning tends to learn global coarse-grained representations of the image that benefit generic object recognition, whereas such coarse-grained features are insufficient for fine-grained visual recognition. In this paper, we incorporate subtle local fine-grained feature learning into global self-supervised contrastive learning through a pure self-supervised global-local fine-grained contrastive learning framework. Specifically, a novel pretext task called local discrimination (LoDisc) is proposed to explicitly supervise the self-supervised model's focus toward local pivotal regions, which are captured by a simple but effective location-wise mask sampling strategy. We show that the LoDisc pretext task can effectively enhance fine-grained clues in important local regions and that the global-local framework further refines the fine-grained feature representations of images. Extensive experimental results on different fine-grained object recognition tasks demonstrate that the proposed method can lead to a decent improvement in different evaluation settings. The proposed method is also effective for general object recognition tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes