CVSep 15, 2020

Video captioning with stacked attention and semantic hard pull

arXiv:2009.07335v32 citations
Originality Incremental advance
AI Analysis

This work addresses the complex problem of generating accurate captions for videos, which is important for applications in computer vision and natural language processing, but it appears incremental as it builds on existing architectures.

The paper tackles video captioning by proposing a novel architecture called Semantically Sensible Video Captioning (SSVC) that uses stacked attention and spatial hard pull to improve context generation, resulting in enhanced performance over state-of-the-art models as measured by BLEU and a new human evaluation metric.

Video captioning, i.e. the task of generating captions from video sequences creates a bridge between the Natural Language Processing and Computer Vision domains of computer science. The task of generating a semantically accurate description of a video is quite complex. Considering the complexity, of the problem, the results obtained in recent research works are praiseworthy. However, there is plenty of scope for further investigation. This paper addresses this scope and proposes a novel solution. Most video captioning models comprise two sequential/recurrent layers - one as a video-to-context encoder and the other as a context-to-caption decoder. This paper proposes a novel architecture, namely Semantically Sensible Video Captioning (SSVC) which modifies the context generation mechanism by using two novel approaches - "stacked attention" and "spatial hard pull". As there are no exclusive metrics for evaluating video captioning models, we emphasize both quantitative and qualitative analysis of our model. Hence, we have used the BLEU scoring metric for quantitative analysis and have proposed a human evaluation metric for qualitative analysis, namely the Semantic Sensibility (SS) scoring metric. SS Score overcomes the shortcomings of common automated scoring metrics. This paper reports that the use of the aforementioned novelties improves the performance of state-of-the-art architectures.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes