SD AI LG ASSep 5, 2025

Recomposer: Event-roll-guided generative audio editing

Daniel P. W. Ellis, Eduardo Fonseca, Ron J. Weiss, Kevin Wilson, Scott Wisdom, Hakan Erdogan, John R. Hershey, Aren Jansen, R. Channing Moore, Manoj Plakal

DeepMind

arXiv:2509.05256v12 citationsh-index: 53

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of editing overlapping sound sources in real-world audio for applications like audio production or content creation, representing a domain-specific incremental advancement.

The paper tackles the problem of editing individual sound events within complex audio scenes by introducing a system that uses textual descriptions and event timing to delete, insert, or enhance sounds, achieving results that highlight the importance of action, class, and timing components.

Editing complex real-world sound scenes is difficult because individual sound sources overlap in time. Generative models can fill-in missing or corrupted details based on their strong prior understanding of the data domain. We present a system for editing individual sound events within complex scenes able to delete, insert, and enhance individual sound events based on textual edit descriptions (e.g., ``enhance Door'') and a graphical representation of the event timing derived from an ``event roll'' transcription. We present an encoder-decoder transformer working on SoundStream representations, trained on synthetic (input, desired output) audio example pairs formed by adding isolated sound events to dense, real-world backgrounds. Evaluation reveals the importance of each part of the edit descriptions -- action, class, timing. Our work demonstrates ``recomposition'' is an important and practical application.

View on arXiv PDF

Similar