CV CLJan 4, 2022

Variational Stacked Local Attention Networks for Diverse Video Captioning

Tonmoay Deb, Akib Sadmanee, Kishor Kumar Bhaumik, Amin Ahsan Ali, M Ashraful Amin, A K M Mahbubur Rahman

arXiv:2201.00985v13.710 citations

Originality Incremental advance

AI Analysis

This addresses the lack of caption diversity in video captioning, which is important for applications requiring multiple descriptive perspectives, though it appears to be an incremental advancement in the domain.

The paper tackles the problem of generating diverse captions for videos by proposing VSLAN, which uses low-rank bilinear pooling and stacked feature streams to improve fine-grained visual representation and diversity encoding. The model achieves CIDEr score improvements of 7.8% on MSVD and 4.5% on MSR-VTT datasets compared to existing methods.

While describing Spatio-temporal events in natural language, video captioning models mostly rely on the encoder's latent visual representation. Recent progress on the encoder-decoder model attends encoder features mainly in linear interaction with the decoder. However, growing model complexity for visual data encourages more explicit feature interaction for fine-grained information, which is currently absent in the video captioning domain. Moreover, feature aggregations methods have been used to unveil richer visual representation, either by the concatenation or using a linear layer. Though feature sets for a video semantically overlap to some extent, these approaches result in objective mismatch and feature redundancy. In addition, diversity in captions is a fundamental component of expressing one event from several meaningful perspectives, currently missing in the temporal, i.e., video captioning domain. To this end, we propose Variational Stacked Local Attention Network (VSLAN), which exploits low-rank bilinear pooling for self-attentive feature interaction and stacking multiple video feature streams in a discount fashion. Each feature stack's learned attributes contribute to our proposed diversity encoding module, followed by the decoding query stage to facilitate end-to-end diverse and natural captions without any explicit supervision on attributes. We evaluate VSLAN on MSVD and MSR-VTT datasets in terms of syntax and diversity. The CIDEr score of VSLAN outperforms current off-the-shelf methods by $7.8\%$ on MSVD and $4.5\%$ on MSR-VTT, respectively. On the same datasets, VSLAN achieves competitive results in caption diversity metrics.

View on arXiv PDF

Similar