CVCLIRJan 12, 2015

A Dataset for Movie Description

arXiv:1501.02530v1562 citations
Originality Synthesis-oriented
AI Analysis

This provides a valuable resource for computer vision and computational linguistics research, particularly for improving video description models, though it is incremental as it builds on prior work with scripts.

The authors tackled the problem of generating video descriptions by introducing a novel dataset of transcribed Descriptive Video Service (DVS) aligned to HD movies, containing over 54,000 sentences and video snippets from 72 movies, and found that DVS descriptions are more visual and accurate than scripts.

Descriptive video service (DVS) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed DVS, which is temporally aligned to full length HD movies. In addition we also collected the aligned movie scripts which have been used in prior work and compare the two different sources of descriptions. In total the Movie Description dataset contains a parallel corpus of over 54,000 sentences and video snippets from 72 HD movies. We characterize the dataset by benchmarking different approaches for generating video descriptions. Comparing DVS to scripts, we find that DVS is far more visual and describes precisely what is shown rather than what should happen according to the scripts created prior to movie production.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes