CV AIFeb 14, 2025

Image Embedding Sampling Method for Diverse Captioning

arXiv:2502.10118v26.21 citationsh-index: 3EMNLP

Originality Incremental advance

AI Analysis

This addresses the need for more accessible and detailed image captioning in resource-constrained applications like mobile devices, though it is incremental as it builds on existing models without new training.

The paper tackles the problem of limited caption diversity and informativeness in smaller vision-language models by introducing a training-free framework that uses structured segmentation to attend to distinct image regions, achieving Div-2 scores of 0.735 to 0.750 on benchmark datasets while maintaining strong relevancy and integrity.

Image Captioning for state-of-the-art VLMs has significantly improved over time; however, this comes at the cost of increased computational complexity, making them less accessible for resource-constrained applications such as mobile devices and assistive technologies. Alternatively, comparably smaller VLMs prioritize high-level scene descriptions, overlooking finer details that contribute to a richer understanding of an image. In this paper, we introduce a training-free framework that enhances caption diversity and informativeness by explicitly attending to distinct image regions using a comparably small VLM, BLIP, as the backbone. Our approach leverages structured segmentation to produce hierarchical representations that capture both global and localized semantics. Without requiring additional model training, we demonstrate that our method allows smaller VLMs to achieve performance comparable to larger models in terms of image-caption alignment, semantic integrity, and diversity. We evaluate our framework on MSCOCO, Flickr30k, and Nocaps test datasets, achieving a Div-2 score of 0.735, 0.750, and 0.748 for each dataset, respectively, while maintaining strong image-caption relevancy and semantic integrity with the human-annotated captions.

View on arXiv PDF

Similar