CVSep 4, 2025

Sali4Vid: Saliency-Aware Video Reweighting and Adaptive Caption Retrieval for Dense Video Captioning

MinJu Jeon, Si-Woo Kim, Ye-Chan Kim, HyunGee Kim, Dong-Jin Kim

arXiv:2509.04602v111.84 citationsh-index: 4EMNLP

Originality Incremental advance

AI Analysis

This work improves dense video captioning for applications like video understanding and retrieval, though it is incremental by building on prior methods with specific enhancements.

The paper tackles dense video captioning by addressing limitations in existing methods, such as equal treatment of video frames and fixed-size chunk retrieval, proposing Sali4Vid with saliency-aware reweighting and adaptive caption retrieval to achieve state-of-the-art results on YouCook2 and ViTT datasets.

Dense video captioning aims to temporally localize events in video and generate captions for each event. While recent works propose end-to-end models, they suffer from two limitations: (1) applying timestamp supervision only to text while treating all video frames equally, and (2) retrieving captions from fixed-size video chunks, overlooking scene transitions. To address these, we propose Sali4Vid, a simple yet effective saliency-aware framework. We introduce Saliency-aware Video Reweighting, which converts timestamp annotations into sigmoid-based frame importance weights, and Semantic-based Adaptive Caption Retrieval, which segments videos by frame similarity to capture scene transitions and improve caption retrieval. Sali4Vid achieves state-of-the-art results on YouCook2 and ViTT, demonstrating the benefit of jointly improving video weighting and retrieval for dense video captioning

View on arXiv PDF

Similar