CV CLFeb 4, 2024

Generalizable Entity Grounding via Assistance of Large Language Model

Lu Qi, Yi-Wen Chen, Lehan Yang, Tiancheng Shen, Xiangtai Li, Weidong Guo, Yu Xu, Ming-Hsuan Yang

arXiv:2402.02555v115.813 citationsh-index: 13

Originality Incremental advance

AI Analysis

This addresses the challenge of accurately associating textual descriptions with visual entities in images for computer vision applications, representing a hybrid but effective approach.

The paper tackles the problem of densely grounding visual entities from long captions by combining a large multimodal model for semantic noun extraction, class-agnostic segmentation, and a multi-modal feature fusion module. The method outperforms state-of-the-art techniques on three tasks, including panoptic narrative grounding, referring expression segmentation, and panoptic segmentation.

In this work, we propose a novel approach to densely ground visual entities from a long caption. We leverage a large multimodal model (LMM) to extract semantic nouns, a class-agnostic segmentation model to generate entity-level segmentation, and the proposed multi-modal feature fusion module to associate each semantic noun with its corresponding segmentation mask. Additionally, we introduce a strategy of encoding entity segmentation masks into a colormap, enabling the preservation of fine-grained predictions from features of high-resolution masks. This approach allows us to extract visual features from low-resolution images using the CLIP vision encoder in the LMM, which is more computationally efficient than existing approaches that use an additional encoder for high-resolution images. Our comprehensive experiments demonstrate the superiority of our method, outperforming state-of-the-art techniques on three tasks, including panoptic narrative grounding, referring expression segmentation, and panoptic segmentation.

View on arXiv PDF

Similar