CVJul 31, 2019

Image Captioning with Unseen Objects

arXiv:1908.00047v117 citations
AI Analysis

This addresses a limitation in image captioning for real-world applications where unseen objects occur, though it appears incremental as it builds on existing zero-shot and template-based methods.

The paper tackles image captioning for images containing objects not seen during training, proposing a zero-shot learning approach that combines a novel zero-shot object detection model with a template-based sentence generator, showing promising results on the COCO dataset.

Image caption generation is a long standing and challenging problem at the intersection of computer vision and natural language processing. A number of recently proposed approaches utilize a fully supervised object recognition model within the captioning approach. Such models, however, tend to generate sentences which only consist of objects predicted by the recognition models, excluding instances of the classes without labelled training examples. In this paper, we propose a new challenging scenario that targets the image captioning problem in a fully zero-shot learning setting, where the goal is to be able to generate captions of test images containing objects that are not seen during training. The proposed approach jointly uses a novel zero-shot object detection model and a template-based sentence generator. Our experiments show promising results on the COCO dataset.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes