CVMar 6, 2024

LoDisc: Learning Global-Local Discriminative Features for Self-Supervised Fine-Grained Visual Recognition

Jialu Shi, Zhiqiang Wei, Jie Nie, Lei Huang

arXiv:2403.04066v25.23 citationsh-index: 18IEEE transactions on circuits and systems for video technology (Print)

Originality Incremental advance

AI Analysis

This addresses the need for better self-supervised methods in fine-grained visual recognition, though it appears incremental by enhancing existing contrastive learning approaches.

The paper tackled the problem of self-supervised contrastive learning being insufficient for fine-grained visual recognition by proposing a global-local framework with a local discrimination pretext task, resulting in decent improvements across different fine-grained object recognition tasks and effectiveness for general object recognition.

The self-supervised contrastive learning strategy has attracted considerable attention due to its exceptional ability in representation learning. However, current contrastive learning tends to learn global coarse-grained representations of the image that benefit generic object recognition, whereas such coarse-grained features are insufficient for fine-grained visual recognition. In this paper, we incorporate subtle local fine-grained feature learning into global self-supervised contrastive learning through a pure self-supervised global-local fine-grained contrastive learning framework. Specifically, a novel pretext task called local discrimination (LoDisc) is proposed to explicitly supervise the self-supervised model's focus toward local pivotal regions, which are captured by a simple but effective location-wise mask sampling strategy. We show that the LoDisc pretext task can effectively enhance fine-grained clues in important local regions and that the global-local framework further refines the fine-grained feature representations of images. Extensive experimental results on different fine-grained object recognition tasks demonstrate that the proposed method can lead to a decent improvement in different evaluation settings. The proposed method is also effective for general object recognition tasks.

View on arXiv PDF

Similar