CV CLMay 31, 2015

Visual Madlibs: Fill in the blank Image Generation and Question Answering

Licheng Yu, Eunbyung Park, Alexander C. Berg, Tamara L. Berg

arXiv:1506.00278v124.898 citations

Originality Synthesis-oriented

AI Analysis

This dataset addresses the need for structured, targeted natural language descriptions in computer vision, enabling tasks like image captioning and visual question answering, though it is incremental as it builds on existing datasets and methods.

The paper introduces the Visual Madlibs dataset with 360,001 descriptions for 10,738 images, collected using fill-in-the-blank templates to target specific aspects like people, objects, and scenes, and demonstrates its use for focused description generation and multiple-choice question-answering tasks with promising experimental results.

In this paper, we introduce a new dataset consisting of 360,001 focused natural language descriptions for 10,738 images. This dataset, the Visual Madlibs dataset, is collected using automatically produced fill-in-the-blank templates designed to gather targeted descriptions about: people and objects, their appearances, activities, and interactions, as well as inferences about the general scene or its broader context. We provide several analyses of the Visual Madlibs dataset and demonstrate its applicability to two new description generation tasks: focused description generation, and multiple-choice question-answering for images. Experiments using joint-embedding and deep learning methods show promising results on these tasks.

View on arXiv PDF

Similar