CLCVSep 30, 2019

Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations

arXiv:1910.00058v11007 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of improving multilingual image search for users by providing fine-grained alignments between sentences and images, though it appears incremental as it builds on existing attention mechanisms.

The paper tackled the problem of multilingual image search by proposing a model with diverse multi-head attention to learn grounded multilingual multimodal representations, achieving significant performance gains over other methods in German-Image and English-Image matching tasks on the Multi30K dataset and in Semantic Textual Similarity tasks.

With the aim of promoting and understanding the multilingual version of image search, we leverage visual object detection and propose a model with diverse multi-head attention to learn grounded multilingual multimodal representations. Specifically, our model attends to different types of textual semantics in two languages and visual objects for fine-grained alignments between sentences and images. We introduce a new objective function which explicitly encourages attention diversity to learn an improved visual-semantic embedding space. We evaluate our model in the German-Image and English-Image matching tasks on the Multi30K dataset, and in the Semantic Textual Similarity task with the English descriptions of visual content. Results show that our model yields a significant performance gain over other methods in all of the three tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes