SDLGASApr 28, 2022

Music Enhancement via Image Translation and Vocoding

arXiv:2204.13289v119 citationsh-index: 21
Originality Synthesis-oriented
AI Analysis

This work addresses distortions in consumer-grade music recordings, such as background noise and reverb, for users of mobile devices, but it is incremental as it builds on existing deep learning techniques for audio enhancement.

The paper tackles the problem of enhancing low-quality music recordings by combining an image-to-image translation model for mel-spectrogram manipulation and a music vocoding model for waveform generation, resulting in outperforming baselines that use classical methods or end-to-end approaches.

Consumer-grade music recordings such as those captured by mobile devices typically contain distortions in the form of background noise, reverb, and microphone-induced EQ. This paper presents a deep learning approach to enhance low-quality music recordings by combining (i) an image-to-image translation model for manipulating audio in its mel-spectrogram representation and (ii) a music vocoding model for mapping synthetically generated mel-spectrograms to perceptually realistic waveforms. We find that this approach to music enhancement outperforms baselines which use classical methods for mel-spectrogram inversion and an end-to-end approach directly mapping noisy waveforms to clean waveforms. Additionally, in evaluating the proposed method with a listening test, we analyze the reliability of common audio enhancement evaluation metrics when used in the music domain.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes