CVAICLLGDec 31, 2018

The meaning of "most" for visual question answering models

arXiv:1812.11737v21090 citations
Originality Synthesis-oriented
AI Analysis

This addresses a specific problem in visual question answering for AI researchers, focusing on incremental insights into model behavior.

The study investigated how deep learning models interpret the quantifier 'most' in visual question answering, finding that the FiLM model develops an approximate number system whose performance degrades with scene complexity as predicted by Weber's law, and identified confounding factors like spatial arrangement.

The correct interpretation of quantifier statements in the context of a visual scene requires non-trivial inference mechanisms. For the example of "most", we discuss two strategies which rely on fundamentally different cognitive concepts. Our aim is to identify what strategy deep learning models for visual question answering learn when trained on such questions. To this end, we carefully design data to replicate experiments from psycholinguistics where the same question was investigated for humans. Focusing on the FiLM visual question answering model, our experiments indicate that a form of approximate number system emerges whose performance declines with more difficult scenes as predicted by Weber's law. Moreover, we identify confounding factors, like spatial arrangement of the scene, which impede the effectiveness of this system.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes