B-SCST: Bayesian Self-Critical Sequence Training for Image Captioning
This work addresses the need for better uncertainty quantification and interpretability in image captioning models, which is critical for practical applications, though it is incremental as it builds on existing SCST methods.
The authors tackled the problem of improving image captioning by proposing B-SCST, a Bayesian variant of Self-Critical Sequence Training, which increased CIDEr-D scores on datasets like Flickr30k, MS COCO, and VizWiz compared to the standard SCST approach.
Bayesian deep neural networks (DNNs) can provide a mathematically grounded framework to quantify uncertainty in predictions from image captioning models. We propose a Bayesian variant of policy-gradient based reinforcement learning training technique for image captioning models to directly optimize non-differentiable image captioning quality metrics such as CIDEr-D. We extend the well-known Self-Critical Sequence Training (SCST) approach for image captioning models by incorporating Bayesian inference, and refer to it as B-SCST. The "baseline" for the policy-gradients in B-SCST is generated by averaging predictive quality metrics (CIDEr-D) of the captions drawn from the distribution obtained using a Bayesian DNN model. We infer this predictive distribution using Monte Carlo (MC) dropout approximate variational inference. We show that B-SCST improves CIDEr-D scores on Flickr30k, MS COCO and VizWiz image captioning datasets, compared to the SCST approach. We also provide a study of uncertainty quantification for the predicted captions, and demonstrate that it correlates well with the CIDEr-D scores. To our knowledge, this is the first such analysis, and it can improve the interpretability of image captioning model outputs, which is critical for practical applications.