CVApr 13, 2022

Semantic-Aware Pretraining for Dense Video Captioning

Teng Wang, Zhu Liu, Feng Zheng, Zhichao Lu, Ran Cheng, Ping Luo

arXiv:2204.07449v13.75 citationsh-index: 29

Originality Incremental advance

AI Analysis

This work addresses the problem of generating accurate captions for multiple events in videos, primarily for researchers in computer vision, but it is incremental as it builds on existing dense video captioning methods.

The authors tackled dense video captioning by introducing a semantic-aware pretraining method to enhance feature recognition of high-level concepts, resulting in a final ensemble model achieving a 10.00 METEOR score on the test set.

This report describes the details of our approach for the event dense-captioning task in ActivityNet Challenge 2021. We present a semantic-aware pretraining method for dense video captioning, which empowers the learned features to recognize high-level semantic concepts. Diverse video features of different modalities are fed into an event captioning module to generate accurate and meaningful sentences. Our final ensemble model achieves a 10.00 METEOR score on the test set.

View on arXiv PDF

Similar