CVSep 18, 2020

Reinforcement Learning for Weakly Supervised Temporal Grounding of Natural Language in Untrimmed Videos

Jie Wu, Guanbin Li, Xiaoguang Han, Liang Lin

arXiv:2009.08614v113.263 citations

Originality Highly original

AI Analysis

This work addresses a fundamental multimedia task for cross-media retrieval, offering a more practical solution by using readily available weak labels, though it is incremental as it extends reinforcement learning to a weakly supervised setting.

The paper tackles the problem of weakly supervised temporal grounding of natural language in untrimmed videos, where only video-level descriptions are available without temporal boundaries, and proposes a Boundary Adaptive Refinement (BAR) framework using reinforcement learning to refine temporal boundaries, achieving state-of-the-art results on benchmarks like Charades-STA and ActivityNet, even outperforming some fully supervised methods.

Temporal grounding of natural language in untrimmed videos is a fundamental yet challenging multimedia task facilitating cross-media visual content retrieval. We focus on the weakly supervised setting of this task that merely accesses to coarse video-level language description annotation without temporal boundary, which is more consistent with reality as such weak labels are more readily available in practice. In this paper, we propose a \emph{Boundary Adaptive Refinement} (BAR) framework that resorts to reinforcement learning (RL) to guide the process of progressively refining the temporal boundary. To the best of our knowledge, we offer the first attempt to extend RL to temporal localization task with weak supervision. As it is non-trivial to obtain a straightforward reward function in the absence of pairwise granular boundary-query annotations, a cross-modal alignment evaluator is crafted to measure the alignment degree of segment-query pair to provide tailor-designed rewards. This refinement scheme completely abandons traditional sliding window based solution pattern and contributes to acquiring more efficient, boundary-flexible and content-aware grounding results. Extensive experiments on two public benchmarks Charades-STA and ActivityNet demonstrate that BAR outperforms the state-of-the-art weakly-supervised method and even beats some competitive fully-supervised ones.

View on arXiv PDF

Similar