Song Data Cleansing for End-to-End Neural Singer Diarization Using Neural Analysis and Synthesis Framework
This work addresses a data bottleneck for singer diarization in popular music, though it is incremental as it builds on existing frameworks and models.
The paper tackles the problem of training singer diarization models when most available song data contains choral singing, which is unsuitable for generating simulated datasets. Their proposed data cleansing method, using a neural analysis and synthesis framework to convert choral singing to solo singing, improved the diarization error rate by 14.8 points on annotated popular duet songs.
We propose a data cleansing method that utilizes a neural analysis and synthesis (NANSY++) framework to train an end-to-end neural diarization model (EEND) for singer diarization. Our proposed model converts song data with choral singing which is commonly contained in popular music and unsuitable for generating a simulated dataset to the solo singing data. This cleansing is based on NANSY++, which is a framework trained to reconstruct an input non-overlapped audio signal. We exploit the pre-trained NANSY++ to convert choral singing into clean, non-overlapped audio. This cleansing process mitigates the mislabeling of choral singing to solo singing and helps the effective training of EEND models even when the majority of available song data contains choral singing sections. We experimentally evaluated the EEND model trained with a dataset using our proposed method using annotated popular duet songs. As a result, our proposed method improved 14.8 points in diarization error rate.