CV MMApr 26, 2024

Exploring the Distinctiveness and Fidelity of the Descriptions Generated by Large Vision-Language Models

Yuhang Huang, Zihan Wu, Chongyang Gao, Jiawei Peng, Xu Yang

arXiv:2404.17534v13.72 citationsh-index: 3

Originality Synthesis-oriented

AI Analysis

It addresses the need for better evaluation of multimodal models' description quality, but is incremental as it focuses on existing models without introducing a new paradigm.

This study tackled the problem of evaluating the distinctiveness and fidelity of fine-grained textual descriptions generated by large vision-language models, finding that MiniGPT-4 outperformed Open-Flamingo and IDEFICS in this capability.

Large Vision-Language Models (LVLMs) are gaining traction for their remarkable ability to process and integrate visual and textual data. Despite their popularity, the capacity of LVLMs to generate precise, fine-grained textual descriptions has not been fully explored. This study addresses this gap by focusing on \textit{distinctiveness} and \textit{fidelity}, assessing how models like Open-Flamingo, IDEFICS, and MiniGPT-4 can distinguish between similar objects and accurately describe visual features. We proposed the Textual Retrieval-Augmented Classification (TRAC) framework, which, by leveraging its generative capabilities, allows us to delve deeper into analyzing fine-grained visual description generation. This research provides valuable insights into the generation quality of LVLMs, enhancing the understanding of multimodal language models. Notably, MiniGPT-4 stands out for its better ability to generate fine-grained descriptions, outperforming the other two models in this aspect. The code is provided at \url{https://anonymous.4open.science/r/Explore_FGVDs-E277}.

View on arXiv PDF

Similar