CVMay 22, 2022

GL-RG: Global-Local Representation Granularity for Video Captioning

Liqi Yan, Qifan Wang, Yiming Cui, Fuli Feng, Xiaojun Quan, Xiangyu Zhang, Dongfang Liu

arXiv:2205.10706v212.771 citationsh-index: 86Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of generating accurate natural language descriptions from videos, which is important for applications like accessibility and video search, but it appears incremental as it builds on existing methods with novel components.

The authors tackled the video captioning problem by proposing a GL-RG framework that models global-local representations across video frames, resulting in significant performance improvements on MSR-VTT and MSVD datasets.

Video captioning is a challenging task as it needs to accurately transform visual understanding into natural language description. To date, state-of-the-art methods inadequately model global-local representation across video frames for caption generation, leaving plenty of room for improvement. In this work, we approach the video captioning task from a new perspective and propose a GL-RG framework for video captioning, namely a \textbf{G}lobal-\textbf{L}ocal \textbf{R}epresentation \textbf{G}ranularity. Our GL-RG demonstrates three advantages over the prior efforts: 1) we explicitly exploit extensive visual representations from different video ranges to improve linguistic expression; 2) we devise a novel global-local encoder to produce rich semantic vocabulary to obtain a descriptive granularity of video contents across frames; 3) we develop an incremental training strategy which organizes model learning in an incremental fashion to incur an optimal captioning behavior. Experimental results on the challenging MSR-VTT and MSVD datasets show that our DL-RG outperforms recent state-of-the-art methods by a significant margin. Code is available at \url{https://github.com/ylqi/GL-RG}.

View on arXiv PDF Code

Similar