Back to Ear: Perceptually Driven High Fidelity Music Reconstruction
This addresses the issue of phase accuracy and stereophonic spatial representation in music signal reconstruction for audio generation tasks, representing an incremental improvement with specific gains.
The paper tackled the problem of auditory perceptual weaknesses in open-source Variational Autoencoders for music reconstruction by proposing εar-VAE, which incorporates perceptual filters and novel phase losses, resulting in substantial outperformance over leading models at 44.1kHz across diverse metrics.
Variational Autoencoders (VAEs) are essential for large-scale audio tasks like diffusion-based generation. However, existing open-source models often neglect auditory perceptual aspects during training, leading to weaknesses in phase accuracy and stereophonic spatial representation. To address these challenges, we propose εar-VAE, an open-source music signal reconstruction model that rethinks and optimizes the VAE training paradigm. Our contributions are threefold: (i) A K-weighting perceptual filter applied prior to loss calculation to align the objective with auditory perception. (ii) Two novel phase losses: a Correlation Loss for stereo coherence, and a Phase Loss using its derivatives--Instantaneous Frequency and Group Delay--for precision. (iii) A new spectral supervision paradigm where magnitude is supervised by all four Mid/Side/Left/Right components, while phase is supervised only by the LR components. Experiments show εar-VAE at 44.1kHz substantially outperforms leading open-source models across diverse metrics, showing particular strength in reconstructing high-frequency harmonics and the spatial characteristics.