CVAug 25, 2024

Evaluating Attribute Comprehension in Large Vision-Language Models

Haiwen Zhang, Zixi Yang, Yuanzhi Liu, Xinran Wang, Zheqi He, Kongming Liang, Zhanyu Ma

arXiv:2408.13898v13.71 citationsh-index: 15Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for better evaluation of fine-grained visual understanding in vision-language models, which is incremental as it builds on existing evaluation efforts.

The paper tackled the problem of evaluating attribute comprehension in large vision-language models, finding that while these models have good attribute recognition ability, their hierarchical understanding is limited, with ITM outperforming ITC for attribute tasks and fine-tuning captions being crucial.

Currently, large vision-language models have gained promising progress on many downstream tasks. However, they still suffer many challenges in fine-grained visual understanding tasks, such as object attribute comprehension. Besides, there have been growing efforts on the evaluations of large vision-language models, but lack of in-depth study of attribute comprehension and the visual language fine-tuning process. In this paper, we propose to evaluate the attribute comprehension ability of large vision-language models from two perspectives: attribute recognition and attribute hierarchy understanding. We evaluate three vision-language interactions, including visual question answering, image-text matching, and image-text cosine similarity. Furthermore, we explore the factors affecting attribute comprehension during fine-tuning. Through a series of quantitative and qualitative experiments, we introduce three main findings: (1) Large vision-language models possess good attribute recognition ability, but their hierarchical understanding ability is relatively limited. (2) Compared to ITC, ITM exhibits superior capability in capturing finer details, making it more suitable for attribute understanding tasks. (3) The attribute information in the captions used for fine-tuning plays a crucial role in attribute understanding. We hope this work can help guide future progress in fine-grained visual understanding of large vision-language models.

View on arXiv PDF Code

Similar