CVFeb 23, 2023

Learning Visual Representations via Language-Guided Sampling

arXiv:2302.12248v239 citationsh-index: 37
Originality Incremental advance
AI Analysis

This addresses the challenge of learning robust visual representations for computer vision tasks by leveraging language abstractions, though it is incremental as it builds on existing contrastive learning methods.

The paper tackled the problem of visual representation learning by using language similarity to sample semantically similar image pairs for contrastive learning, resulting in better features than image-based and image-text approaches.

Although an object may appear in numerous contexts, we often describe it in a limited number of ways. Language allows us to abstract away visual variation to represent and communicate concepts. Building on this intuition, we propose an alternative approach to visual representation learning: using language similarity to sample semantically similar image pairs for contrastive learning. Our approach diverges from image-based contrastive learning by sampling view pairs using language similarity instead of hand-crafted augmentations or learned clusters. Our approach also differs from image-text contrastive learning by relying on pre-trained language models to guide the learning rather than directly minimizing a cross-modal loss. Through a series of experiments, we show that language-guided learning yields better features than image-based and image-text representation learning approaches.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes