CVMay 19, 2022

Support-set based Multi-modal Representation Enhancement for Video Captioning

Xiaoya Chen, Jingkuan Song, Pengpeng Zeng, Lianli Gao, Heng Tao Shen

arXiv:2205.09307v11.46 citationsh-index: 61Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of generating more informative captions in video captioning, though it appears incremental as it builds on existing methods by enhancing semantic associations.

The paper tackles the problem of rigid and uninformative expressions in video captioning by proposing a support-set based framework to mine semantic associations between samples, achieving state-of-the-art performance on MSVD and MSR-VTT datasets.

Video captioning is a challenging task that necessitates a thorough comprehension of visual scenes. Existing methods follow a typical one-to-one mapping, which concentrates on a limited sample space while ignoring the intrinsic semantic associations between samples, resulting in rigid and uninformative expressions. To address this issue, we propose a novel and flexible framework, namely Support-set based Multi-modal Representation Enhancement (SMRE) model, to mine rich information in a semantic subspace shared between samples. Specifically, we propose a Support-set Construction (SC) module to construct a support-set to learn underlying connections between samples and obtain semantic-related visual elements. During this process, we design a Semantic Space Transformation (SST) module to constrain relative distance and administrate multi-modal interactions in a self-supervised way. Extensive experiments on MSVD and MSR-VTT datasets demonstrate that our SMRE achieves state-of-the-art performance.

View on arXiv PDF Code

Similar