CVMar 22, 2025

GOAL: Global-local Object Alignment Learning

arXiv:2503.17782v215 citationsh-index: 4CVPR
Originality Incremental advance
AI Analysis

This addresses a limitation in adapting CLIP for detailed textual descriptions, offering an incremental enhancement for tasks requiring fine-grained understanding.

The paper tackles the problem of vision-language models like CLIP struggling with lengthy text descriptions by introducing GOAL, a fine-tuning method that improves image-lengthy text retrieval, showing significant improvements over baseline CLIP on new benchmarks.

Vision-language models like CLIP have shown impressive capabilities in aligning images and text, but they often struggle with lengthy and detailed text descriptions because of their training focus on short and concise captions. We present GOAL (Global-local Object Alignment Learning), a novel fine-tuning method that enhances CLIP's ability to handle lengthy text by leveraging both global and local semantic alignments between image and lengthy text. Our approach consists of two key components: Local Image-Sentence Matching (LISM), which identifies corresponding pairs between image segments and descriptive sentences, and Token Similarity-based Learning (TSL), which efficiently propagates local element attention through these matched pairs. Evaluating GOAL on three new benchmarks for image-lengthy text retrieval, we demonstrate significant improvements over baseline CLIP fine-tuning, establishing a simple yet effective approach for adapting CLIP to detailed textual descriptions. Through extensive experiments, we show that our method's focus on local semantic alignment alongside global context leads to more nuanced and representative embeddings, particularly beneficial for tasks requiring fine-grained understanding of lengthy text descriptions.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes