CVDec 15, 2021

Dense Video Captioning Using Unsupervised Semantic Information

arXiv:2112.08455v214 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses video understanding for AI applications, offering an incremental improvement by replacing audio signals and enhancing visual-only methods.

The paper tackles dense video captioning by learning unsupervised semantic visual information through clustering and co-occurrence probability encoding, achieving state-of-the-art performance in captioning with only visual features and competitive results with multi-modal methods.

We introduce a method to learn unsupervised semantic visual information based on the premise that complex events can be decomposed into simpler events and that these simple events are shared across several complex events. We first employ a clustering method to group representations producing a visual codebook. Then, we learn a dense representation by encoding the co-occurrence probability matrix for the codebook entries. This representation leverages the performance of the dense video captioning task in a scenario with only visual features. For example, we replace the audio signal in the BMT method and produce temporal proposals with comparable performance. Furthermore, we concatenate the visual representation with our descriptor in a vanilla transformer method to achieve state-of-the-art performance in the captioning subtask compared to the methods that explore only visual features, as well as a competitive performance with multi-modal methods. Our code is available at https://github.com/valterlej/dvcusi.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes