Discriminability objective for training descriptive captions
This addresses the issue of non-discriminative captions in image captioning systems, which is incremental as it enhances existing methods rather than introducing a new paradigm.
The paper tackled the problem of generating image captions that lack discriminability, proposing a training objective that incorporates a loss component for disambiguating image/caption matches, resulting in captions that are much more discriminative according to human evaluation and also improve standard scores like BLEU and SPICE.
One property that remains lacking in image captions generated by contemporary methods is discriminability: being able to tell two images apart given the caption for one of them. We propose a way to improve this aspect of caption generation. By incorporating into the captioning training objective a loss component directly related to ability (by a machine) to disambiguate image/caption matches, we obtain systems that produce much more discriminative caption, according to human evaluation. Remarkably, our approach leads to improvement in other aspects of generated captions, reflected by a battery of standard scores such as BLEU, SPICE etc. Our approach is modular and can be applied to a variety of model/loss combinations commonly proposed for image captioning.