CVFeb 15, 2019

Deeply Supervised Multimodal Attentional Translation Embeddings for Visual Relationship Detection

arXiv:1902.05829v119 citations
Originality Incremental advance
AI Analysis

This addresses scene understanding for computer vision applications, but appears incremental as it builds on prior work with a novel hybrid approach.

The paper tackles visual relationship detection by introducing a deeply supervised two-branch architecture with multimodal attentional translation embeddings, which outperforms all existing methods on the VRD dataset.

Detecting visual relationships, i.e. <Subject, Predicate, Object> triplets, is a challenging Scene Understanding task approached in the past via linguistic priors or spatial information in a single feature branch. We introduce a new deeply supervised two-branch architecture, the Multimodal Attentional Translation Embeddings, where the visual features of each branch are driven by a multimodal attentional mechanism that exploits spatio-linguistic similarities in a low-dimensional space. We present a variety of experiments comparing against all related approaches in the literature, as well as by re-implementing and fine-tuning several of them. Results on the commonly employed VRD dataset [1] show that the proposed method clearly outperforms all others, while we also justify our claims both quantitatively and qualitatively.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes