SDNov 16, 2022
Conditional variational autoencoder to improve neural audio synthesis for polyphonic music soundSeokjin Lee, Minhan Kim, Seunghyeon Shin et al.
Deep generative models for audio synthesis have recently been significantly improved. However, the task of modeling raw-waveforms remains a difficult problem, especially for audio waveforms and music signals. Recently, the realtime audio variational autoencoder (RAVE) method was developed for high-quality audio waveform synthesis. The RAVE method is based on the variational autoencoder and utilizes the two-stage training strategy. Unfortunately, the RAVE model is limited in reproducing wide-pitch polyphonic music sound. Therefore, to enhance the reconstruction performance, we adopt the pitch activation data as an auxiliary information to the RAVE model. To handle the auxiliary information, we propose an enhanced RAVE model with a conditional variational autoencoder structure and an additional fully-connected layer. To evaluate the proposed structure, we conducted a listening experiment based on multiple stimulus tests with hidden references and an anchor (MUSHRA) with the MAESTRO. The obtained results indicate that the proposed model exhibits a more significant performance and stability improvement than the conventional RAVE model.
SDJul 17, 2025
Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That MachineAnastasia Kuznetsova, Inseon Jang, Wootaek Lim et al.
Neural audio codecs, leveraging quantization algorithms, have significantly impacted various speech/audio tasks. While high-fidelity reconstruction is paramount for human perception, audio coding for machines (ACoM) prioritizes efficient compression and downstream task performance, disregarding perceptual nuances. This work introduces an efficient ACoM method that can compress and quantize any chosen intermediate feature representation of an already trained speech/audio downstream model. Our approach employs task-specific loss guidance alongside residual vector quantization (RVQ) losses, providing ultra-low bitrates (i.e., less than 200 bps) with a minimal loss of the downstream model performance. The resulting tokenizer is adaptable to various bitrates and model sizes for flexible deployment. Evaluated on automatic speech recognition and audio classification, our method demonstrates its efficacy and potential for broader task and architectural applicability through appropriate regularization.
LGMay 25, 2021
Deep Neural Networks and End-to-End Learning for Audio CompressionDaniela N. Rim, Inseon Jang, Heeyoul Choi
Recent achievements in end-to-end deep learning have encouraged the exploration of tasks dealing with highly structured data with unified deep network models. Having such models for compressing audio signals has been challenging since it requires discrete representations that are not easy to train with end-to-end backpropagation. In this paper, we present an end-to-end deep learning approach that combines recurrent neural networks (RNNs) within the training strategy of variational autoencoders (VAEs) with a binary representation of the latent space. We apply a reparametrization trick for the Bernoulli distribution for the discrete representations, which allows smooth backpropagation. In addition, our approach allows the separation of the encoder and decoder, which is necessary for compression tasks. To our best knowledge, this is the first end-to-end learning for a single audio compression model with RNNs, and our model achieves a Signal to Distortion Ratio (SDR) of 20.54.
ASNov 5, 2019
Emotional speech synthesis with rich and granularized controlSe-Yun Um, Sangshin Oh, Kyungguen Byun et al.
This paper proposes an effective emotion control method for an end-to-end text-to-speech (TTS) system. To flexibly control the distinct characteristic of a target emotion category, it is essential to determine embedding vectors representing the TTS input. We introduce an inter-to-intra emotional distance ratio algorithm to the embedding vectors that can minimize the distance to the target emotion category while maximizing its distance to the other emotion categories. To further enhance the expressiveness of a target speech, we also introduce an effective interpolation technique that enables the intensity of a target emotion to be gradually changed to that of neutral speech. Subjective evaluation results in terms of emotional expressiveness and controllability show the superiority of the proposed algorithm to the conventional methods.