CVJun 4, 2025

Language-Image Alignment with Fixed Text Encoders

arXiv:2506.04209v13 citationsh-index: 10
Originality Incremental advance
AI Analysis

This work addresses the computational inefficiency of joint training in language-image alignment for researchers and practitioners, offering an incremental improvement with a simplified framework.

The paper tackles the problem of costly joint training for language-image alignment by proposing LIFT, which uses a fixed pre-trained large language model as the text encoder and trains only the image encoder. The result shows that LIFT outperforms CLIP in most scenarios involving compositional understanding and long captions, achieving considerable computational efficiency gains.

Currently, the most dominant approach to establishing language-image alignment is to pre-train text and image encoders jointly through contrastive learning, such as CLIP and its variants. In this work, we question whether such a costly joint training is necessary. In particular, we investigate if a pre-trained fixed large language model (LLM) offers a good enough text encoder to guide visual representation learning. That is, we propose to learn Language-Image alignment with a Fixed Text encoder (LIFT) from an LLM by training only the image encoder. Somewhat surprisingly, through comprehensive benchmarking and ablation studies, we find that this much simplified framework LIFT is highly effective and it outperforms CLIP in most scenarios that involve compositional understanding and long captions, while achieving considerable gains in computational efficiency. Our work takes a first step towards systematically exploring how text embeddings from LLMs can guide visual learning and suggests an alternative design choice for learning language-aligned visual representations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes