Activitynet 2019 Task 3: Exploring Contexts for Dense Captioning Events in Videos
This work addresses the challenge of generating accurate and diverse captions for multiple events in videos, which is important for video understanding applications, but it is incremental as it builds on existing methods with systematic exploration.
The paper tackled the problem of dense captioning events in long untrimmed videos by exploring different captioning models with various contexts, achieving a state-of-the-art performance with a 9.91 METEOR score on the challenge testing set.
Contextual reasoning is essential to understand events in long untrimmed videos. In this work, we systematically explore different captioning models with various contexts for the dense-captioning events in video task, which aims to generate captions for different events in the untrimmed video. We propose five types of contexts as well as two categories of event captioning models, and evaluate their contributions for event captioning from both accuracy and diversity aspects. The proposed captioning models are plugged into our pipeline system for the dense video captioning challenge. The overall system achieves the state-of-the-art performance on the dense-captioning events in video task with 9.91 METEOR score on the challenge testing set.