SimBase: A Simple Baseline for Temporal Video Grounding
This work provides a simple baseline for researchers in video understanding, potentially streamlining evaluations and inspiring new ideas, though it is incremental in method design.
The paper tackles the problem of temporal video grounding by proposing SimBase, a simplified network that uses lightweight temporal convolutions and element-wise product for multimodal fusion, achieving state-of-the-art results on two large-scale datasets.
This paper presents SimBase, a simple yet effective baseline for temporal video grounding. While recent advances in temporal grounding have led to impressive performance, they have also driven network architectures toward greater complexity, with a range of methods to (1) capture temporal relationships and (2) achieve effective multimodal fusion. In contrast, this paper explores the question: How effective can a simplified approach be? To investigate, we design SimBase, a network that leverages lightweight, one-dimensional temporal convolutional layers instead of complex temporal structures. For cross-modal interaction, SimBase only employs an element-wise product instead of intricate multimodal fusion. Remarkably, SimBase achieves state-of-the-art results on two large-scale datasets. As a simple yet powerful baseline, we hope SimBase will spark new ideas and streamline future evaluations in temporal video grounding.