CVApr 15, 2022

Image Captioning In the Transformer Age

Yang Xu, Li Li, Haiyang Xu, Songfang Huang, Fei Huang, Jianfei Cai

arXiv:2204.07374v15.77 citationsh-index: 48Has Code

Originality Synthesis-oriented

AI Analysis

This is an incremental survey paper for researchers in computer vision and NLP, focusing on the role of image captioning in the Transformer era.

The paper surveys image captioning in the context of Transformers, highlighting that while large-scale models have reduced the task's prominence, it remains significant by analyzing connections with self-supervised learning paradigms.

Image Captioning (IC) has achieved astonishing developments by incorporating various techniques into the CNN-RNN encoder-decoder architecture. However, since CNN and RNN do not share the basic network component, such a heterogeneous pipeline is hard to be trained end-to-end where the visual encoder will not learn anything from the caption supervision. This drawback inspires the researchers to develop a homogeneous architecture that facilitates end-to-end training, for which Transformer is the perfect one that has proven its huge potential in both vision and language domains and thus can be used as the basic component of the visual encoder and language decoder in an IC pipeline. Meantime, self-supervised learning releases the power of the Transformer architecture that a pre-trained large-scale one can be generalized to various tasks including IC. The success of these large-scale models seems to weaken the importance of the single IC task. However, we demonstrate that IC still has its specific significance in this age by analyzing the connections between IC with some popular self-supervised learning paradigms. Due to the page limitation, we only refer to highly important papers in this short survey and more related works can be found at https://github.com/SjokerLily/awesome-image-captioning.

View on arXiv PDF Code

Similar