SDAICVLGASApr 25, 2024

Synthesizing Audio from Silent Video using Sequence to Sequence Modeling

arXiv:2404.17608v1
Originality Incremental advance
AI Analysis

This work addresses the need for better audio synthesis in video contexts, such as surveillance and historical media, but is incremental as it builds on existing techniques like VQ-VAE.

The paper tackles the problem of generating audio from silent video by proposing a sequence-to-sequence model that improves on prior methods, achieving enhanced sound diversity and generalization for applications like CCTV analysis and silent movie restoration.

Generating audio from a video's visual context has multiple practical applications in improving how we interact with audio-visual media - for example, enhancing CCTV footage analysis, restoring historical videos (e.g., silent movies), and improving video generation models. We propose a novel method to generate audio from video using a sequence-to-sequence model, improving on prior work that used CNNs and WaveNet and faced sound diversity and generalization challenges. Our approach employs a 3D Vector Quantized Variational Autoencoder (VQ-VAE) to capture the video's spatial and temporal structures, decoding with a custom audio decoder for a broader range of sounds. Trained on the Youtube8M dataset segment, focusing on specific domains, our model aims to enhance applications like CCTV footage analysis, silent movie restoration, and video generation models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes