CLDec 2, 2017

Improving Visually Grounded Sentence Representations with Self-Attention

arXiv:1712.00609v12 citations
Originality Incremental advance
AI Analysis

This work addresses the grounding issue in sentence representations for NLP and vision tasks, but it is incremental as it builds on existing joint training methods with a specific architectural enhancement.

The paper tackled the grounding problem in sentence representation models by applying self-attention mechanisms to sentence encoders to deepen connections with image features, resulting in improved performance on transfer tasks as self-attentive encoders better exploit visually associated words.

Sentence representation models trained only on language could potentially suffer from the grounding problem. Recent work has shown promising results in improving the qualities of sentence representations by jointly training them with associated image features. However, the grounding capability is limited due to distant connection between input sentences and image features by the design of the architecture. In order to further close the gap, we propose applying self-attention mechanism to the sentence encoder to deepen the grounding effect. Our results on transfer tasks show that self-attentive encoders are better for visual grounding, as they exploit specific words with strong visual associations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes