CVLGMMSDASJun 4, 2025

Sounding that Object: Interactive Object-Aware Image to Audio Generation

arXiv:2506.04214v1h-index: 25ICML
Originality Highly original
AI Analysis

This work addresses the challenge of multi-object sound generation for interactive audio-visual applications, representing an incremental improvement with a novel method for a known bottleneck.

The paper tackles the problem of generating accurate sounds for complex audio-visual scenes with multiple objects by proposing an interactive object-aware audio generation model that grounds sound in user-selected visual objects, achieving better alignment between objects and their associated sounds as shown in quantitative and qualitative evaluations.

Generating accurate sounds for complex audio-visual scenes is challenging, especially in the presence of multiple objects and sound sources. In this paper, we propose an {\em interactive object-aware audio generation} model that grounds sound generation in user-selected visual objects within images. Our method integrates object-centric learning into a conditional latent diffusion model, which learns to associate image regions with their corresponding sounds through multi-modal attention. At test time, our model employs image segmentation to allow users to interactively generate sounds at the {\em object} level. We theoretically validate that our attention mechanism functionally approximates test-time segmentation masks, ensuring the generated audio aligns with selected objects. Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment between objects and their associated sounds. Project page: https://tinglok.netlify.app/files/avobject/

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes