CVApr 18, 2017

Learning to Reason: End-to-End Module Networks for Visual Question Answering

arXiv:1704.05526v3605 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of brittle parsers in modular reasoning for AI systems, offering an incremental improvement in visual question answering.

The paper tackles the problem of compositional visual question answering by proposing End-to-End Module Networks (N2NMNs) that learn to predict network layouts directly from data, achieving a nearly 50% error reduction compared to state-of-the-art attentional methods on the CLEVR dataset.

Natural language questions are inherently compositional, and many are most easily answered by reasoning about their decomposition into modular sub-problems. For example, to answer "is there an equal number of balls and boxes?" we can look for balls, look for boxes, count them, and compare the results. The recently proposed Neural Module Network (NMN) architecture implements this approach to question answering by parsing questions into linguistic substructures and assembling question-specific deep networks from smaller modules that each solve one subtask. However, existing NMN implementations rely on brittle off-the-shelf parsers, and are restricted to the module configurations proposed by these parsers rather than learning them from data. In this paper, we propose End-to-End Module Networks (N2NMNs), which learn to reason by directly predicting instance-specific network layouts without the aid of a parser. Our model learns to generate network structures (by imitating expert demonstrations) while simultaneously learning network parameters (using the downstream task loss). Experimental results on the new CLEVR dataset targeted at compositional question answering show that N2NMNs achieve an error reduction of nearly 50% relative to state-of-the-art attentional approaches, while discovering interpretable network architectures specialized for each question.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes