CVJul 26, 2019

Cooperative image captioning

arXiv:1907.11565v12 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of generating informative and natural image captions for AI systems, representing a strong specific gain in a domain-specific area.

The paper tackled the challenges of training cooperative speaker-listener networks for image captioning by introducing PSST optimization and naturalness constraints, resulting in a recall@10 improvement from 60% to 86% on the COCO benchmark while maintaining language naturalness.

When describing images with natural language, the descriptions can be made more informative if tuned using downstream tasks. This is often achieved by training two networks: a "speaker network" that generates sentences given an image, and a "listener network" that uses them to perform a task. Unfortunately, training multiple networks jointly to communicate to achieve a joint task, faces two major challenges. First, the descriptions generated by a speaker network are discrete and stochastic, making optimization very hard and inefficient. Second, joint training usually causes the vocabulary used during communication to drift and diverge from natural language. We describe an approach that addresses both challenges. We first develop a new effective optimization based on partial-sampling from a multinomial distribution combined with straight-through gradient updates, which we name PSST for Partial-Sampling Straight-Through. Second, we show that the generated descriptions can be kept close to natural by constraining them to be similar to human descriptions. Together, this approach creates descriptions that are both more discriminative and more natural than previous approaches. Evaluations on the standard COCO benchmark show that PSST Multinomial dramatically improve the recall@10 from 60% to 86% maintaining comparable language naturalness, and human evaluations show that it also increases naturalness while keeping the discriminative power of generated captions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes