CVNov 30, 2016

Modeling Relationships in Referential Expressions with Compositional Modular Networks

arXiv:1611.09978v1425 citations
Originality Highly original
AI Analysis

This addresses the challenge of interpreting complex natural language references in images for applications like human-computer interaction, though it is an incremental improvement over prior modular approaches.

The paper tackles the problem of grounding referential expressions in images by decomposing them into entities and relationships, and presents Compositional Modular Networks (CMNs) that outperform state-of-the-art methods on multiple datasets.

People often refer to entities in an image in terms of their relationships with other entities. For example, "the black cat sitting under the table" refers to both a "black cat" entity and its relationship with another "table" entity. Understanding these relationships is essential for interpreting and grounding such natural language expressions. Most prior work focuses on either grounding entire referential expressions holistically to one region, or localizing relationships based on a fixed set of categories. In this paper we instead present a modular deep architecture capable of analyzing referential expressions into their component parts, identifying entities and relationships mentioned in the input expression and grounding them all in the scene. We call this approach Compositional Modular Networks (CMNs): a novel architecture that learns linguistic analysis and visual inference end-to-end. Our approach is built around two types of neural modules that inspect local regions and pairwise interactions between regions. We evaluate CMNs on multiple referential expression datasets, outperforming state-of-the-art approaches on all tasks.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes