CV CLSep 7, 2023

Evaluation and Enhancement of Semantic Grounding in Large Vision-Language Models

Jiaying Lu, Jinmeng Rao, Kezhen Chen, Xiaoyuan Guo, Yawen Zhang, Baochen Sun, Carl Yang, Jie Yang

arXiv:2309.04041v216.417 citationsh-index: 8

Originality Incremental advance

AI Analysis

This work addresses a critical problem for real-world applications of LVLMs, such as safety and robustness, by providing a comprehensive evaluation and enhancement method, though it is incremental as it builds on existing LVLM frameworks.

The paper tackled the problem of constrained semantic grounding in Large Vision-Language Models (LVLMs), which affects safety and reliability, by evaluating seven popular models and finding prevalent misgrounding issues. It proposed a data-centric enhancement method using multimodal instruction tuning, resulting in notable improvements in addressing these issues.

Large Vision-Language Models (LVLMs) offer remarkable benefits for a variety of vision-language tasks. However, a challenge hindering their application in real-world scenarios, particularly regarding safety, robustness, and reliability, is their constrained semantic grounding ability, which pertains to connecting language to the physical-world entities or concepts referenced in images. Therefore, a crucial need arises for a comprehensive study to assess the semantic grounding ability of widely used LVLMs. Despite the significance, sufficient investigation in this direction is currently lacking. Our work bridges this gap by designing a pipeline for generating large-scale evaluation datasets covering fine-grained semantic information, such as color, number, material, etc., along with a thorough assessment of seven popular LVLMs' semantic grounding ability. Results highlight prevalent misgrounding across various aspects and degrees. To address this issue, we propose a data-centric enhancement method that aims to improve LVLMs' semantic grounding ability through multimodal instruction tuning on fine-grained conversations. Experiments on enhanced LVLMs demonstrate notable improvements in addressing misgrounding issues.

View on arXiv PDF

Similar