Learning Source Disentanglement in Neural Audio Codec
This addresses the challenge of controllability in sound generation for audio processing applications, but it is incremental as it builds on existing neural codec methods.
The paper tackled the problem of neural audio codecs neglecting domain discrepancies by introducing SD-Codec, which combines audio coding and source separation to assign different audio domains to distinct codebooks, resulting in competitive resynthesis quality and successful disentanglement in the latent space.
Neural audio codecs have significantly advanced audio compression by efficiently converting continuous audio signals into discrete tokens. These codecs preserve high-quality sound and enable sophisticated sound generation through generative models trained on these tokens. However, existing neural codec models are typically trained on large, undifferentiated audio datasets, neglecting the essential discrepancies between sound domains like speech, music, and environmental sound effects. This oversight complicates data modeling and poses additional challenges to the controllability of sound generation. To tackle these issues, we introduce the Source-Disentangled Neural Audio Codec (SD-Codec), a novel approach that combines audio coding and source separation. By jointly learning audio resynthesis and separation, SD-Codec explicitly assigns audio signals from different domains to distinct codebooks, sets of discrete representations. Experimental results indicate that SD-Codec not only maintains competitive resynthesis quality but also, supported by the separation results, demonstrates successful disentanglement of different sources in the latent space, thereby enhancing interpretability in audio codec and providing potential finer control over the audio generation process.