CVLGMay 18, 2018

Improving Image Captioning with Conditional Generative Adversarial Nets

arXiv:1805.07112v4105 citations
Originality Incremental advance
AI Analysis

This addresses evaluation inconsistencies in image captioning for AI and computer vision researchers, though it is incremental as it builds on existing RL-based methods.

The authors tackled inconsistent evaluation in image captioning by introducing a conditional GAN framework with discriminator networks to distinguish human vs. machine-generated captions, achieving consistent improvements across all language metrics for state-of-the-art models.

In this paper, we propose a novel conditional-generative-adversarial-nets-based image captioning framework as an extension of traditional reinforcement-learning (RL)-based encoder-decoder architecture. To deal with the inconsistent evaluation problem among different objective language metrics, we are motivated to design some "discriminator" networks to automatically and progressively determine whether generated caption is human described or machine generated. Two kinds of discriminator architectures (CNN and RNN-based structures) are introduced since each has its own advantages. The proposed algorithm is generic so that it can enhance any existing RL-based image captioning framework and we show that the conventional RL training method is just a special case of our approach. Empirically, we show consistent improvements over all language evaluation metrics for different state-of-the-art image captioning models. In addition, the well-trained discriminators can also be viewed as objective image captioning evaluators

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes