LGCLCVSep 19, 2024

Embedding Geometries of Contrastive Language-Image Pre-Training

arXiv:2409.13079v15 citationsh-index: 3
Originality Incremental advance
AI Analysis

This work addresses the optimization of contrastive pre-training methods for researchers in multimodal AI, though it is incremental as it builds on established CLIP frameworks.

The paper tackled the problem of CLIP's fixed design choices by experimenting with alternative geometries and softmax logits, finding that Euclidean CLIP (EuCLIP) matches or exceeds CLIP's performance and supports hierarchical relationships as effectively as hyperbolic alternatives.

Since the publication of CLIP, the approach of using InfoNCE loss for contrastive pre-training has become widely popular for bridging two or more modalities. Despite its wide adoption, CLIP's original design choices of L2 normalization and cosine similarity logit have rarely been revisited. We have systematically experimented with alternative geometries and softmax logits for language-image pre-training and identified that variants with intuitive Euclidean geometry, Euclidean CLIP (EuCLIP), match or exceed the performance of CLIP and support hierarchical relationships at least as well as more complicated hyperbolic alternative.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes