IRAICLCVDec 21, 2021

Multimodal Entity Tagging with Multimodal Knowledge Base

arXiv:2201.00693v21 citations
AI Analysis

This work introduces a new task for researchers in multimodal information processing, but it is incremental as it builds on existing methods and datasets.

The paper tackles the problem of multimodal entity tagging by proposing a new task that uses a multimodal knowledge base to identify entities in text-image pairs, and results show current technologies achieve relatively high performance on this challenging task.

To enhance research on multimodal knowledge base and multimodal information processing, we propose a new task called multimodal entity tagging (MET) with a multimodal knowledge base (MKB). We also develop a dataset for the problem using an existing MKB. In an MKB, there are entities and their associated texts and images. In MET, given a text-image pair, one uses the information in the MKB to automatically identify the related entity in the text-image pair. We solve the task by using the information retrieval paradigm and implement several baselines using state-of-the-art methods in NLP and CV. We conduct extensive experiments and make analyses on the experimental results. The results show that the task is challenging, but current technologies can achieve relatively high performance. We will release the dataset, code, and models for future research.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes