Diverse Video Captioning Through Latent Variable Expansion
This addresses the need for diverse video descriptions in computer vision, which is incremental as it builds on existing captioning methods by focusing on diversity rather than just accuracy.
The paper tackles the problem of generating diverse video captions by proposing a framework that uses latent variables from an encoder-decoder process with a conditional GAN, achieving superior results on benchmark datasets compared to state-of-the-art methods.
Automatically describing video content with text description is challenging but important task, which has been attracting a lot of attention in computer vision community. Previous works mainly strive for the accuracy of the generated sentences, while ignoring the sentences diversity, which is inconsistent with human behavior. In this paper, we aim to caption each video with multiple descriptions and propose a novel framework. Concretely, for a given video, the intermediate latent variables of conventional encode-decode process are utilized as input to the conditional generative adversarial network (CGAN) with the purpose of generating diverse sentences. We adopt different Convolutional Neural Networks (CNNs) as our generator that produces descriptions conditioned on latent variables and discriminator that assesses the quality of generated sentences. Simultaneously, a novel DCE metric is designed to assess the diverse captions. We evaluate our method on the benchmark datasets, where it demonstrates its ability to generate diverse descriptions and achieves superior results against other state-of-the-art methods.