Complex ratio masking for singing voice separation
This work addresses music source separation for applications like karaoke and remixing, showing incremental improvement by incorporating phase information.
The paper tackled singing voice separation by proposing a complex ratio masking method that estimates both real and imaginary STFT components, outperforming recent state-of-the-art models for voice and accompaniment separation.
Music source separation is important for applications such as karaoke and remixing. Much of previous research focuses on estimating short-time Fourier transform (STFT) magnitude and discarding phase information. We observe that, for singing voice separation, phase can make considerable improvement in separation quality. This paper proposes a complex ratio masking method for voice and accompaniment separation. The proposed method employs DenseUNet with self attention to estimate the real and imaginary components of STFT for each sound source. A simple ensemble technique is introduced to further improve separation performance. Evaluation results demonstrate that the proposed method outperforms recent state-of-the-art models for both separated voice and accompaniment.