ASSDNov 3, 2020

Complex ratio masking for singing voice separation

arXiv:2011.02008v112 citations
AI Analysis

This work addresses music source separation for applications like karaoke and remixing, showing incremental improvement by incorporating phase information.

The paper tackled singing voice separation by proposing a complex ratio masking method that estimates both real and imaginary STFT components, outperforming recent state-of-the-art models for voice and accompaniment separation.

Music source separation is important for applications such as karaoke and remixing. Much of previous research focuses on estimating short-time Fourier transform (STFT) magnitude and discarding phase information. We observe that, for singing voice separation, phase can make considerable improvement in separation quality. This paper proposes a complex ratio masking method for voice and accompaniment separation. The proposed method employs DenseUNet with self attention to estimate the real and imaginary components of STFT for each sound source. A simple ensemble technique is introduced to further improve separation performance. Evaluation results demonstrate that the proposed method outperforms recent state-of-the-art models for both separated voice and accompaniment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes