Invert4TVG: A Temporal Video Grounding Framework with Inversion Tasks for Enhanced Action Understanding
This work addresses robust action understanding in video analysis for applications like video retrieval and surveillance, representing a novel method for a known bottleneck rather than incremental.
The paper tackles the problem of Temporal Video Grounding (TVG), where current methods overfit to localization metrics and compromise semantic action understanding, by introducing Invert4TVG, a framework that integrates inversion tasks to enhance both localization and semantics, resulting in a 7.1% improvement in R1@0.7 on Charades-STA compared to state-of-the-art methods.
Temporal Video Grounding (TVG) seeks to localize video segments matching a given textual query. Current methods, while optimizing for high temporal Intersection-over-Union (IoU), often overfit to this metric, compromising semantic action understanding in the video and query, a critical factor for robust TVG. To address this, we introduce Inversion Tasks for TVG (Invert4TVG), a novel framework that enhances both localization accuracy and action understanding without additional data. Our approach leverages three inversion tasks derived from existing TVG annotations: (1) Verb Completion, predicting masked action verbs in queries from video segments; (2) Action Recognition, identifying query-described actions; and (3) Video Description, generating descriptions of video segments that explicitly embed query-relevant actions. These tasks, integrated with TVG via a reinforcement learning framework with well-designed reward functions, ensure balanced optimization of localization and semantics. Experiments show our method outperforms state-of-the-art approaches, achieving a 7.1\% improvement in R1@0.7 on Charades-STA for a 3B model compared to Time-R1. By inverting TVG to derive query-related actions from segments, our approach strengthens semantic understanding, significantly raising the ceiling of localization accuracy.