CV AI CL IR MMDec 18, 2024

Reverse Region-to-Entity Annotation for Pixel-Level Visual Entity Linking

Zhengfei Xu, Sijia Zhao, Yanchao Hao, Xiaolong Liu, Lili Li, Yuyang Yin, Bo Li, Xi Chen, Xin Xin

arXiv:2412.13614v12.01 citationsh-index: 25Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of convenient visual entity linking for researchers in computer vision, though it appears incremental as it builds on existing VEL tasks with new inputs and methods.

The paper tackles the challenge of fine-grained visual understanding by introducing Pixel-Level Visual Entity Linking (PL-VEL), which uses pixel masks instead of text to link image objects to knowledge base entities, and reports that models trained on their automatically constructed dataset improved accuracy by 18 points over zero-shot models.

Visual Entity Linking (VEL) is a crucial task for achieving fine-grained visual understanding, matching objects within images (visual mentions) to entities in a knowledge base. Previous VEL tasks rely on textual inputs, but writing queries for complex scenes can be challenging. Visual inputs like clicks or bounding boxes offer a more convenient alternative. Therefore, we propose a new task, Pixel-Level Visual Entity Linking (PL-VEL), which uses pixel masks from visual inputs to refer to objects, supplementing reference methods for VEL. To facilitate research on this task, we have constructed the MaskOVEN-Wiki dataset through an entirely automatic reverse region-entity annotation framework. This dataset contains over 5 million annotations aligning pixel-level regions with entity-level labels, which will advance visual understanding towards fine-grained. Moreover, as pixel masks correspond to semantic regions in an image, we enhance previous patch-interacted attention with region-interacted attention by a visual semantic tokenization approach. Manual evaluation results indicate that the reverse annotation framework achieved a 94.8% annotation success rate. Experimental results show that models trained on this dataset improved accuracy by 18 points compared to zero-shot models. Additionally, the semantic tokenization method achieved a 5-point accuracy improvement over the trained baseline.

View on arXiv PDF Code

Similar