CVDec 21, 2023

Multi-Sentence Grounding for Long-term Instructional Video

Zeqian Li, Qirui Chen, Tengda Han, Ya Zhang, Yanfeng Wang, Weidi Xie

arXiv:2312.14055v211.014 citationsh-index: 32ECCV

Originality Incremental advance

AI Analysis

This work addresses the challenge of aligning multiple descriptive sentences to video segments in instructional videos, which is incremental as it builds on existing methods with specific improvements in dataset quality and model architecture.

The authors tackled the problem of noisy instructional video datasets by creating HowToStep, a high-quality video-text dataset with multiple descriptive steps, and proposed a Transformer-based model for multi-sentence grounding that achieved state-of-the-art performance with improvements of 9.0% on HT-Step, 5.1% on HTM-Align, and 1.9% on CrossTask.

In this paper, we aim to establish an automatic, scalable pipeline for denoising the large-scale instructional dataset and construct a high-quality video-text dataset with multiple descriptive steps supervision, named HowToStep. We make the following contributions: (i) improving the quality of sentences in dataset by upgrading ASR systems to reduce errors from speech recognition and prompting a large language model to transform noisy ASR transcripts into descriptive steps; (ii) proposing a Transformer-based architecture with all texts as queries, iteratively attending to the visual features, to temporally align the generated steps to corresponding video segments. To measure the quality of our curated datasets, we train models for the task of multi-sentence grounding on it, i.e., given a long-form video, and associated multiple sentences, to determine their corresponding timestamps in the video simultaneously, as a result, the model shows superior performance on a series of multi-sentence grounding tasks, surpassing existing state-of-the-art methods by a significant margin on three public benchmarks, namely, 9.0% on HT-Step, 5.1% on HTM-Align and 1.9% on CrossTask. All codes, models, and the resulting dataset have been publicly released.

View on arXiv PDF

Similar