SDLGASNov 16, 2022

Conditional variational autoencoder to improve neural audio synthesis for polyphonic music sound

arXiv:2211.08715v11 citationsh-index: 6
Originality Incremental advance
AI Analysis

This work addresses a specific limitation in audio synthesis for polyphonic music, representing an incremental improvement to an existing method.

The paper tackles the problem of poor reconstruction of wide-pitch polyphonic music in neural audio synthesis by enhancing the RAVE model with pitch activation data and a conditional variational autoencoder structure, resulting in significant performance and stability improvements over the conventional RAVE model as shown in MUSHRA listening tests.

Deep generative models for audio synthesis have recently been significantly improved. However, the task of modeling raw-waveforms remains a difficult problem, especially for audio waveforms and music signals. Recently, the realtime audio variational autoencoder (RAVE) method was developed for high-quality audio waveform synthesis. The RAVE method is based on the variational autoencoder and utilizes the two-stage training strategy. Unfortunately, the RAVE model is limited in reproducing wide-pitch polyphonic music sound. Therefore, to enhance the reconstruction performance, we adopt the pitch activation data as an auxiliary information to the RAVE model. To handle the auxiliary information, we propose an enhanced RAVE model with a conditional variational autoencoder structure and an additional fully-connected layer. To evaluate the proposed structure, we conducted a listening experiment based on multiple stimulus tests with hidden references and an anchor (MUSHRA) with the MAESTRO. The obtained results indicate that the proposed model exhibits a more significant performance and stability improvement than the conventional RAVE model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes