CVAIMMMar 29, 2023

Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation

arXiv:2303.16541v125 citationsh-index: 65
Originality Incremental advance
AI Analysis

This addresses a rarely investigated multi-modal video generation problem for applications requiring audio-visual synthesis, though it appears incremental as it builds on existing tokenization and transformer techniques.

The paper tackles the problem of generating realistic videos with synchronized audio from text descriptions, proposing the Sounding Video Generator (SVG) framework that achieves superior performance compared to existing text-to-video and audio generation methods on datasets like Kinetics and VAS.

As a combination of visual and audio signals, video is inherently multi-modal. However, existing video generation methods are primarily intended for the synthesis of visual frames, whereas audio signals in realistic videos are disregarded. In this work, we concentrate on a rarely investigated problem of text guided sounding video generation and propose the Sounding Video Generator (SVG), a unified framework for generating realistic videos along with audio signals. Specifically, we present the SVG-VQGAN to transform visual frames and audio melspectrograms into discrete tokens. SVG-VQGAN applies a novel hybrid contrastive learning method to model inter-modal and intra-modal consistency and improve the quantized representations. A cross-modal attention module is employed to extract associated features of visual frames and audio signals for contrastive learning. Then, a Transformer-based decoder is used to model associations between texts, visual frames, and audio signals at token level for auto-regressive sounding video generation. AudioSetCap, a human annotated text-video-audio paired dataset, is produced for training SVG. Experimental results demonstrate the superiority of our method when compared with existing textto-video generation methods as well as audio generation methods on Kinetics and VAS datasets.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes