FoleyGRAM: Video-to-Audio Generation with GRAM-Aligned Multimodal Encoders
This work addresses the problem of generating semantically aligned audio from video for applications in multimedia and AI, representing an incremental advancement over prior methods.
The paper tackles video-to-audio generation by introducing FoleyGRAM, which uses GRAM-aligned multimodal encoders to improve semantic control, resulting in enhanced alignment of generated audio with video content on the Greatest Hits dataset.
In this work, we present FoleyGRAM, a novel approach to video-to-audio generation that emphasizes semantic conditioning through the use of aligned multimodal encoders. Building on prior advancements in video-to-audio generation, FoleyGRAM leverages the Gramian Representation Alignment Measure (GRAM) to align embeddings across video, text, and audio modalities, enabling precise semantic control over the audio generation process. The core of FoleyGRAM is a diffusion-based audio synthesis model conditioned on GRAM-aligned embeddings and waveform envelopes, ensuring both semantic richness and temporal alignment with the corresponding input video. We evaluate FoleyGRAM on the Greatest Hits dataset, a standard benchmark for video-to-audio models. Our experiments demonstrate that aligning multimodal encoders using GRAM enhances the system's ability to semantically align generated audio with video content, advancing the state of the art in video-to-audio synthesis.