CVLGFeb 12, 2021

The MSR-Video to Text Dataset with Clean Annotations

arXiv:2102.06448v418 citations
Originality Synthesis-oriented
AI Analysis

This work addresses data quality issues in video captioning for researchers and practitioners, but it is incremental as it focuses on cleaning an existing dataset rather than introducing new methods.

The authors tackled the problem of noisy annotations in the MSR-VTT video captioning dataset by cleaning it to remove duplicates and grammatical errors, which boosted model performance by measurable improvements in quantitative metrics and human evaluations.

Video captioning automatically generates short descriptions of the video content, usually in form of a single sentence. Many methods have been proposed for solving this task. A large dataset called MSR Video to Text (MSR-VTT) is often used as the benchmark dataset for testing the performance of the methods. However, we found that the human annotations, i.e., the descriptions of video contents in the dataset are quite noisy, e.g., there are many duplicate captions and many captions contain grammatical problems. These problems may pose difficulties to video captioning models for learning underlying patterns. We cleaned the MSR-VTT annotations by removing these problems, then tested several typical video captioning models on the cleaned dataset. Experimental results showed that data cleaning boosted the performances of the models measured by popular quantitative metrics. We recruited subjects to evaluate the results of a model trained on the original and cleaned datasets. The human behavior experiment demonstrated that trained on the cleaned dataset, the model generated captions that were more coherent and more relevant to the contents of the video clips.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes