CV LGOct 26, 2019

Diverse Video Captioning Through Latent Variable Expansion

arXiv:1910.12019v615 citations

Originality Incremental advance

AI Analysis

This addresses the need for diverse video descriptions in computer vision, which is incremental as it builds on existing captioning methods by focusing on diversity rather than just accuracy.

The paper tackles the problem of generating diverse video captions by proposing a framework that uses latent variables from an encoder-decoder process with a conditional GAN, achieving superior results on benchmark datasets compared to state-of-the-art methods.

Automatically describing video content with text description is challenging but important task, which has been attracting a lot of attention in computer vision community. Previous works mainly strive for the accuracy of the generated sentences, while ignoring the sentences diversity, which is inconsistent with human behavior. In this paper, we aim to caption each video with multiple descriptions and propose a novel framework. Concretely, for a given video, the intermediate latent variables of conventional encode-decode process are utilized as input to the conditional generative adversarial network (CGAN) with the purpose of generating diverse sentences. We adopt different Convolutional Neural Networks (CNNs) as our generator that produces descriptions conditioned on latent variables and discriminator that assesses the quality of generated sentences. Simultaneously, a novel DCE metric is designed to assess the diverse captions. We evaluate our method on the benchmark datasets, where it demonstrates its ability to generate diverse descriptions and achieves superior results against other state-of-the-art methods.

View on arXiv PDF

Similar