CVMMAug 2, 2021

Distributed Attention for Grounded Image Captioning

arXiv:2108.01056v223 citations
Originality Incremental advance
AI Analysis

This addresses the partial grounding issue in weakly supervised image captioning, improving alignment accuracy for applications like accessibility tools, though it is an incremental advance over prior methods.

The paper tackles the problem of weakly supervised grounded image captioning, where noun words in generated captions must be aligned to image regions without explicit supervision, by proposing a distributed attention mechanism that aggregates information from multiple regions to cover entire objects, achieving state-of-the-art results in experiments.

We study the problem of weakly supervised grounded image captioning. That is, given an image, the goal is to automatically generate a sentence describing the context of the image with each noun word grounded to the corresponding region in the image. This task is challenging due to the lack of explicit fine-grained region word alignments as supervision. Previous weakly supervised methods mainly explore various kinds of regularization schemes to improve attention accuracy. However, their performances are still far from the fully supervised ones. One main issue that has been ignored is that the attention for generating visually groundable words may only focus on the most discriminate parts and can not cover the whole object. To this end, we propose a simple yet effective method to alleviate the issue, termed as partial grounding problem in our paper. Specifically, we design a distributed attention mechanism to enforce the network to aggregate information from multiple spatially different regions with consistent semantics while generating the words. Therefore, the union of the focused region proposals should form a visual region that encloses the object of interest completely. Extensive experiments have demonstrated the superiority of our proposed method compared with the state-of-the-arts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes