CVAug 18, 2020

Linguistically-aware Attention for Reducing the Semantic-Gap in Vision-Language Tasks

arXiv:2008.08012v1
AI Analysis

This work addresses the semantic gap in vision-language tasks for AI systems, offering incremental improvements by enhancing attention mechanisms with linguistic awareness.

The paper tackled the semantic gap in vision-language tasks by proposing Linguistically-aware Attention (LAT), which integrates object attributes and language models to enhance linguistic understanding, achieving state-of-the-art results in Counting-VQA and improving performance in VQA and image captioning.

Attention models are widely used in Vision-language (V-L) tasks to perform the visual-textual correlation. Humans perform such a correlation with a strong linguistic understanding of the visual world. However, even the best performing attention model in V-L tasks lacks such a high-level linguistic understanding, thus creating a semantic gap between the modalities. In this paper, we propose an attention mechanism - Linguistically-aware Attention (LAT) - that leverages object attributes obtained from generic object detectors along with pre-trained language models to reduce this semantic gap. LAT represents visual and textual modalities in a common linguistically-rich space, thus providing linguistic awareness to the attention process. We apply and demonstrate the effectiveness of LAT in three V-L tasks: Counting-VQA, VQA, and Image captioning. In Counting-VQA, we propose a novel counting-specific VQA model to predict an intuitive count and achieve state-of-the-art results on five datasets. In VQA and Captioning, we show the generic nature and effectiveness of LAT by adapting it into various baselines and consistently improving their performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes