CVDec 15, 2021

Dense Video Captioning Using Unsupervised Semantic Information

Valter Estevam, Rayson Laroca, Helio Pedrini, David Menotti

arXiv:2112.08455v26.514 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses video understanding for AI applications, offering an incremental improvement by replacing audio signals and enhancing visual-only methods.

The paper tackles dense video captioning by learning unsupervised semantic visual information through clustering and co-occurrence probability encoding, achieving state-of-the-art performance in captioning with only visual features and competitive results with multi-modal methods.

We introduce a method to learn unsupervised semantic visual information based on the premise that complex events can be decomposed into simpler events and that these simple events are shared across several complex events. We first employ a clustering method to group representations producing a visual codebook. Then, we learn a dense representation by encoding the co-occurrence probability matrix for the codebook entries. This representation leverages the performance of the dense video captioning task in a scenario with only visual features. For example, we replace the audio signal in the BMT method and produce temporal proposals with comparable performance. Furthermore, we concatenate the visual representation with our descriptor in a vanilla transformer method to achieve state-of-the-art performance in the captioning subtask compared to the methods that explore only visual features, as well as a competitive performance with multi-modal methods. Our code is available at https://github.com/valterlej/dvcusi.

View on arXiv PDF Code

Similar