CVOct 7, 2023

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

IBMMIT
arXiv:2310.04900v236 citationsh-index: 137
Originality Incremental advance
AI Analysis

This work addresses the challenge of obtaining aligned video-text data at scale for multimodal learning, particularly in instructional videos, though it is incremental as it builds on existing LLM capabilities.

The authors tackled the problem of noisy supervision in large-scale video datasets by using LLMs to generate high-quality video captions from ASR subtitles, resulting in significant performance improvements in zero-shot text-video retrieval and video captioning benchmarks.

Instructional videos are a common source for learning text-video or even multimodal representations by leveraging subtitles extracted with automatic speech recognition systems (ASR) from the audio signal in the videos. However, in contrast to human-annotated captions, both speech and subtitles naturally differ from the visual content of the videos and thus provide only noisy supervision. As a result, large-scale annotation-free web video training data remains sub-optimal for training text-video models. In this work, we propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale. Specifically, we prompt an LLM to create plausible video captions based on ASR subtitles of instructional videos. To this end, we introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence. We further prompt the LLM to generate timestamps for each produced caption based on the timestamps of the subtitles and finally align the generated captions to the video temporally. In this way, we obtain human-style video captions at scale without human supervision. We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption. Our evaluation shows that the resulting captions not only significantly improve the performance over many different benchmark datasets for zero-shot text-video retrieval and video captioning, but also lead to a disentangling of textual narration from the audio, boosting the performance in text-video-audio tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes