Linguistically-aware Attention for Reducing the Semantic-Gap in Vision-Language Tasks
This work addresses the semantic gap in vision-language tasks for AI systems, offering incremental improvements by enhancing attention mechanisms with linguistic awareness.
The paper tackled the semantic gap in vision-language tasks by proposing Linguistically-aware Attention (LAT), which integrates object attributes and language models to enhance linguistic understanding, achieving state-of-the-art results in Counting-VQA and improving performance in VQA and image captioning.
Attention models are widely used in Vision-language (V-L) tasks to perform the visual-textual correlation. Humans perform such a correlation with a strong linguistic understanding of the visual world. However, even the best performing attention model in V-L tasks lacks such a high-level linguistic understanding, thus creating a semantic gap between the modalities. In this paper, we propose an attention mechanism - Linguistically-aware Attention (LAT) - that leverages object attributes obtained from generic object detectors along with pre-trained language models to reduce this semantic gap. LAT represents visual and textual modalities in a common linguistically-rich space, thus providing linguistic awareness to the attention process. We apply and demonstrate the effectiveness of LAT in three V-L tasks: Counting-VQA, VQA, and Image captioning. In Counting-VQA, we propose a novel counting-specific VQA model to predict an intuitive count and achieve state-of-the-art results on five datasets. In VQA and Captioning, we show the generic nature and effectiveness of LAT by adapting it into various baselines and consistently improving their performance.