CVLGJan 27, 2022

Transformer Module Networks for Systematic Generalization in Visual Question Answering

arXiv:2201.11316v212 citations
Originality Incremental advance
AI Analysis

This addresses systematic generalization for VQA systems, though it is incremental as it builds on Neural Module Networks with Transformer modules.

The paper tackles the problem of systematic generalization in Visual Question Answering by introducing Transformer Module Networks (TMNs), which improve performance by over 30% over standard Transformers for novel compositions of sub-tasks.

Transformers achieve great performance on Visual Question Answering (VQA). However, their systematic generalization capabilities, i.e., handling novel combinations of known concepts, is unclear. We reveal that Neural Module Networks (NMNs), i.e., question-specific compositions of modules that tackle a sub-task, achieve better or similar systematic generalization performance than the conventional Transformers, even though NMNs' modules are CNN-based. In order to address this shortcoming of Transformers with respect to NMNs, in this paper we investigate whether and how modularity can bring benefits to Transformers. Namely, we introduce Transformer Module Network (TMN), a novel NMN based on compositions of Transformer modules. TMNs achieve state-of-the-art systematic generalization performance in three VQA datasets, improving more than 30% over standard Transformers for novel compositions of sub-tasks. We show that not only the module composition but also the module specialization for each sub-task are the key of such performance gain.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes