CLAIJun 27, 2024

DIM: Dynamic Integration of Multimodal Entity Linking with Large Language Model

arXiv:2407.12019v1Has Code
Originality Incremental advance
AI Analysis

This work addresses ambiguous entity representations and limited image utilization in multimodal entity linking, offering incremental improvements for researchers in natural language processing and computer vision.

The study tackled Multimodal Entity Linking by proposing DIM, a method that uses ChatGPT for dynamic entity extraction and LLMs like BLIP-2 for visual understanding, achieving state-of-the-art results on enhanced datasets such as Wiki+, Rich+, and Diverse+.

Our study delves into Multimodal Entity Linking, aligning the mention in multimodal information with entities in knowledge base. Existing methods are still facing challenges like ambiguous entity representations and limited image information utilization. Thus, we propose dynamic entity extraction using ChatGPT, which dynamically extracts entities and enhances datasets. We also propose a method: Dynamically Integrate Multimodal information with knowledge base (DIM), employing the capability of the Large Language Model (LLM) for visual understanding. The LLM, such as BLIP-2, extracts information relevant to entities in the image, which can facilitate improved extraction of entity features and linking them with the dynamic entity representations provided by ChatGPT. The experiments demonstrate that our proposed DIM method outperforms the majority of existing methods on the three original datasets, and achieves state-of-the-art (SOTA) on the dynamically enhanced datasets (Wiki+, Rich+, Diverse+). For reproducibility, our code and collected datasets are released on \url{https://github.com/season1blue/DIM}.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes