CVJan 22, 2021

A Closer Look at Temporal Sentence Grounding in Videos: Dataset and Metric

Yitian Yuan, Xiaohan Lan, Xin Wang, Long Chen, Zhi Wang, Wenwu Zhu

arXiv:2101.09028v320.272 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This work addresses benchmarking issues in TSGV, a critical task for video understanding, by exposing flaws in current practices and providing improved evaluation tools, though it is incremental as it focuses on assessment rather than new modeling.

The paper identifies that existing evaluation protocols for Temporal Sentence Grounding in Videos (TSGV) are unreliable due to dataset biases and flawed metrics, and proposes reorganized dataset splits and a new metric (dR@n,IoU@m) that better assess model performance, as demonstrated by experiments on eight state-of-the-art methods.

Temporal Sentence Grounding in Videos (TSGV), i.e., grounding a natural language sentence which indicates complex human activities in a long and untrimmed video sequence, has received unprecedented attentions over the last few years. Although each newly proposed method plausibly can achieve better performance than previous ones, current TSGV models still tend to capture the moment annotation biases and fail to take full advantage of multi-modal inputs. Even more incredibly, several extremely simple baselines without training can also achieve state-of-the-art performance. In this paper, we take a closer look at the existing evaluation protocols for TSGV, and find that both the prevailing dataset splits and evaluation metrics are the devils to cause unreliable benchmarking. To this end, we propose to re-organize two widely-used TSGV benchmarks (ActivityNet Captions and Charades-STA). Specifically, we deliberately make the ground-truth moment distribution different in the training and test splits, i.e., out-of-distribution (OOD) testing. Meanwhile, we introduce a new evaluation metric dR@n,IoU@m to calibrate the basic IoU scores by penalizing on the bias-influenced moment predictions and alleviate the inflating evaluations caused by the dataset annotation biases such as overlong ground-truth moments. Under our new evaluation protocol, we conduct extensive experiments and ablation studies on eight state-of-the-art TSGV methods. All the results demonstrate that the re-organized dataset splits and new metric can better monitor the progress in TSGV. Our reorganized datsets are available at https://github.com/yytzsy/grounding_changing_distribution.

View on arXiv PDF Code

Similar