CVCLMay 31, 2015

Visual Madlibs: Fill in the blank Image Generation and Question Answering

arXiv:1506.00278v198 citations
Originality Synthesis-oriented
AI Analysis

This dataset addresses the need for structured, targeted natural language descriptions in computer vision, enabling tasks like image captioning and visual question answering, though it is incremental as it builds on existing datasets and methods.

The paper introduces the Visual Madlibs dataset with 360,001 descriptions for 10,738 images, collected using fill-in-the-blank templates to target specific aspects like people, objects, and scenes, and demonstrates its use for focused description generation and multiple-choice question-answering tasks with promising experimental results.

In this paper, we introduce a new dataset consisting of 360,001 focused natural language descriptions for 10,738 images. This dataset, the Visual Madlibs dataset, is collected using automatically produced fill-in-the-blank templates designed to gather targeted descriptions about: people and objects, their appearances, activities, and interactions, as well as inferences about the general scene or its broader context. We provide several analyses of the Visual Madlibs dataset and demonstrate its applicability to two new description generation tasks: focused description generation, and multiple-choice question-answering for images. Experiments using joint-embedding and deep learning methods show promising results on these tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes