SDJun 26, 2023
Mono-to-stereo through parametric stereo generationJoan Serrà, Davide Scaini, Santiago Pascual et al.
Generating a stereophonic presentation from a monophonic audio signal is a challenging open task, especially if the goal is to obtain a realistic spatial imaging with a specific panning of sound elements. In this work, we propose to convert mono to stereo by means of predicting parametric stereo (PS) parameters using both nearest neighbor and deep network approaches. In combination with PS, we also propose to model the task with generative approaches, allowing to synthesize multiple and equally-plausible stereo renditions from the same mono signal. To achieve this, we consider both autoregressive and masked token modelling approaches. We provide evidence that the proposed PS-based models outperform a competitive classical decorrelation baseline and that, within a PS prediction framework, modern generative models outshine equivalent non-generative counterparts. Overall, our work positions both PS and generative modelling as strong and appealing methodologies for mono-to-stereo upmixing. A discussion of the limitations of these approaches is also provided.
SDNov 23, 2021
Upsampling layers for music source separationJordi Pons, Joan Serrà, Santiago Pascual et al.
Upsampling artifacts are caused by problematic upsampling layers and due to spectral replicas that emerge while upsampling. Also, depending on the used upsampling layer, such artifacts can either be tonal artifacts (additive high-frequency noise) or filtering artifacts (substractive, attenuating some bands). In this work we investigate the practical implications of having upsampling artifacts in the resulting audio, by studying how different artifacts interact and assessing their impact on the models' performance. To that end, we benchmark a large set of upsampling layers for music source separation: different transposed and subpixel convolution setups, different interpolation upsamplers (including two novel layers based on stretch and sinc interpolation), and different wavelet-based upsamplers (including a novel learnable wavelet layer). Our results show that filtering artifacts, associated with interpolation upsamplers, are perceptually preferrable, even if they tend to achieve worse objective scores.
SDFeb 11, 2021
Multichannel-based learning for audio object extractionDaniel Arteaga, Jordi Pons
The current paradigm for creating and deploying immersive audio content is based on audio objects, which are composed of an audio track and position metadata. While rendering an object-based production into a multichannel mix is straightforward, the reverse process involves sound source separation and estimating the spatial trajectories of the extracted sources. Besides, cinematic object-based productions are often composed by dozens of simultaneous audio objects, which poses a scalability challenge for audio object extraction. Here, we propose a novel deep learning approach to object extraction that learns from the multichannel renders of object-based productions, instead of directly learning from the audio objects themselves. This approach allows tackling the object scalability challenge and also offers the possibility to formulate the problem in a supervised or an unsupervised fashion. Since, to our knowledge, no other works have previously addressed this topic, we first define the task and propose an evaluation methodology, and then discuss under what circumstances our methods outperform the proposed baselines.
ASOct 13, 2020
Sound event localization and detection based on crnn using rectangular filters and channel rotation data augmentationFrancesca Ronchini, Daniel Arteaga, Andrés Pérez-López
Sound Event Localization and Detection refers to the problem of identifying the presence of independent or temporally-overlapped sound sources, correctly identifying to which sound class it belongs, estimating their spatial directions while they are active. In the last years, neural networks have become the prevailing method for sound Event Localization and Detection task, with convolutional recurrent neural networks being among the most used systems. This paper presents a system submitted to the Detection and Classification of Acoustic Scenes and Events 2020 Challenge Task 3. The algorithm consists of a convolutional recurrent neural network using rectangular filters, specialized in recognizing significant spectral features related to the task. In order to further improve the score and to generalize the system performance to unseen data, the training dataset size has been increased using data augmentation. The technique used for that is based on channel rotations and reflection on the xy plane in the First Order Ambisonic domain, which allows improving Direction of Arrival labels keeping the physical relationships between channels. Evaluation results on the development dataset show that the proposed system outperforms the baseline results, considerably improving Error Rate and F-score for location-aware detection.