CLCVOct 20, 2023

Semi-supervised multimodal coreference resolution in image narrations

arXiv:2310.13619v1133 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses the problem of fine-grained image-text alignment and ambiguity in narrative language for researchers in multimodal AI, but it is incremental as it builds on existing semi-supervised methods.

The paper tackled multimodal coreference resolution in image narrations by proposing a semi-supervised approach, which outperformed strong baselines in both coreference resolution and narrative grounding tasks.

In this paper, we study multimodal coreference resolution, specifically where a longer descriptive text, i.e., a narration is paired with an image. This poses significant challenges due to fine-grained image-text alignment, inherent ambiguity present in narrative language, and unavailability of large annotated training sets. To tackle these challenges, we present a data efficient semi-supervised approach that utilizes image-narration pairs to resolve coreferences and narrative grounding in a multimodal context. Our approach incorporates losses for both labeled and unlabeled data within a cross-modal framework. Our evaluation shows that the proposed approach outperforms strong baselines both quantitatively and qualitatively, for the tasks of coreference resolution and narrative grounding.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes