CVCLNov 14, 2025

Discovering Meaningful Units with Visually Grounded Semantics from Image Captions

arXiv:2511.11262v1h-index: 3
Originality Incremental advance
AI Analysis

This work addresses the challenge of acquiring fine-grained knowledge in vision-language models for better real-world understanding, though it appears incremental as it builds on existing alignment methods.

The paper tackles the problem of fine-grained understanding in vision-language models by proposing a model that groups caption tokens to capture meaningful language units, showing that this approach improves fine-grained understanding and discovers token groups highly similar to groundable phrases in text.

Fine-grained knowledge is crucial for vision-language models to obtain a better understanding of the real world. While there has been work trying to acquire this kind of knowledge in the space of vision and language, it has mostly focused on aligning the image patches with the tokens on the language side. However, image patches do not have any meaning to the human eye, and individual tokens do not necessarily carry groundable information in the image. It is groups of tokens which describe different aspects of the scene. In this work, we propose a model which groups the caption tokens as part of its architecture in order to capture a fine-grained representation of the language. We expect our representations to be at the level of objects present in the image, and therefore align our representations with the output of an image encoder trained to discover objects. We show that by learning to group the tokens, the vision-language model has a better fine-grained understanding of vision and language. In addition, the token groups that our model discovers are highly similar to groundable phrases in text, both qualitatively and quantitatively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes