CVAICLIRMMDec 18, 2024

Reverse Region-to-Entity Annotation for Pixel-Level Visual Entity Linking

arXiv:2412.13614v11 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work addresses the problem of convenient visual entity linking for researchers in computer vision, though it appears incremental as it builds on existing VEL tasks with new inputs and methods.

The paper tackles the challenge of fine-grained visual understanding by introducing Pixel-Level Visual Entity Linking (PL-VEL), which uses pixel masks instead of text to link image objects to knowledge base entities, and reports that models trained on their automatically constructed dataset improved accuracy by 18 points over zero-shot models.

Visual Entity Linking (VEL) is a crucial task for achieving fine-grained visual understanding, matching objects within images (visual mentions) to entities in a knowledge base. Previous VEL tasks rely on textual inputs, but writing queries for complex scenes can be challenging. Visual inputs like clicks or bounding boxes offer a more convenient alternative. Therefore, we propose a new task, Pixel-Level Visual Entity Linking (PL-VEL), which uses pixel masks from visual inputs to refer to objects, supplementing reference methods for VEL. To facilitate research on this task, we have constructed the MaskOVEN-Wiki dataset through an entirely automatic reverse region-entity annotation framework. This dataset contains over 5 million annotations aligning pixel-level regions with entity-level labels, which will advance visual understanding towards fine-grained. Moreover, as pixel masks correspond to semantic regions in an image, we enhance previous patch-interacted attention with region-interacted attention by a visual semantic tokenization approach. Manual evaluation results indicate that the reverse annotation framework achieved a 94.8% annotation success rate. Experimental results show that models trained on this dataset improved accuracy by 18 points compared to zero-shot models. Additionally, the semantic tokenization method achieved a 5-point accuracy improvement over the trained baseline.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes