CLAICVApr 24, 2017

Multi-Task Video Captioning with Video and Entailment Generation

arXiv:1704.07489v2121 citations
Originality Highly original
AI Analysis

This work addresses the problem of generating accurate video descriptions for applications like accessibility and content indexing, though it is incremental as it builds on existing multi-task approaches.

The paper tackles the challenge of accurately learning temporal and logical dynamics in video captioning by sharing knowledge with video prediction and entailment generation tasks through a multi-task learning model, achieving significant improvements and new state-of-the-art results on standard datasets.

Video captioning, the task of describing the content of a video, has seen some promising improvements in recent years with sequence-to-sequence models, but accurately learning the temporal and logical dynamics involved in the task still remains a challenge, especially given the lack of sufficient annotated data. We improve video captioning by sharing knowledge with two related directed-generation tasks: a temporally-directed unsupervised video prediction task to learn richer context-aware video encoder representations, and a logically-directed language entailment generation task to learn better video-entailed caption decoder representations. For this, we present a many-to-many multi-task learning model that shares parameters across the encoders and decoders of the three tasks. We achieve significant improvements and the new state-of-the-art on several standard video captioning datasets using diverse automatic and human evaluations. We also show mutual multi-task improvements on the entailment generation task.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes