CVApr 24, 2020

MoVie: Revisiting Modulated Convolutions for Visual Counting and Beyond

arXiv:2004.11883v323 citations
AI Analysis

It addresses the problem of efficient and accurate visual counting for AI systems, offering a simple alternative to expensive symbolic models, though it is incremental in building on modulated convolutions.

The paper tackles visual counting by proposing MoVie, a method using modulated convolutions to fuse queries and images locally, which advances state-of-the-art on counting-specific VQA tasks, outperforms prior art on benchmarks like COCO, and helped win the 2020 VQA challenge for number-related questions.

This paper focuses on visual counting, which aims to predict the number of occurrences given a natural image and a query (e.g. a question or a category). Unlike most prior works that use explicit, symbolic models which can be computationally expensive and limited in generalization, we propose a simple and effective alternative by revisiting modulated convolutions that fuse the query and the image locally. Following the design of residual bottleneck, we call our method MoVie, short for Modulated conVolutional bottlenecks. Notably, MoVie reasons implicitly and holistically and only needs a single forward-pass during inference. Nevertheless, MoVie showcases strong performance for counting: 1) advancing the state-of-the-art on counting-specific VQA tasks while being more efficient; 2) outperforming prior-art on difficult benchmarks like COCO for common object counting; 3) helped us secure the first place of 2020 VQA challenge when integrated as a module for 'number' related questions in generic VQA models. Finally, we show evidence that modulated convolutions such as MoVie can serve as a general mechanism for reasoning tasks beyond counting.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes