CLCVLGMar 27, 2023

Curriculum Learning for Compositional Visual Reasoning

arXiv:2303.15006v13 citationsh-index: 21
Originality Synthesis-oriented
AI Analysis

This work addresses the high computational and data demands in VQA, offering an incremental improvement for researchers and practitioners in computer vision and AI.

The paper tackled the problem of reducing training cost and data requirements for Visual Question Answering (VQA) using Neural Module Networks (NMN) on the GQA dataset, by employing Curriculum Learning (CL) with predefined cross-modal embeddings and intermediate losses, resulting in greatly reduced training costs and data usage with limited impact on final accuracy.

Visual Question Answering (VQA) is a complex task requiring large datasets and expensive training. Neural Module Networks (NMN) first translate the question to a reasoning path, then follow that path to analyze the image and provide an answer. We propose an NMN method that relies on predefined cross-modal embeddings to ``warm start'' learning on the GQA dataset, then focus on Curriculum Learning (CL) as a way to improve training and make a better use of the data. Several difficulty criteria are employed for defining CL methods. We show that by an appropriate selection of the CL method the cost of training and the amount of training data can be greatly reduced, with a limited impact on the final VQA accuracy. Furthermore, we introduce intermediate losses during training and find that this allows to simplify the CL strategy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes