USTC-NELSLIP System Description for DIHARD-III Challenge
This work addresses speech diarization for the DIHARD-III challenge, presenting an incremental improvement through system combination and domain-specific processing.
The paper tackled speech diarization by combining front-end techniques like speech separation and target-speaker VAD with iterative data purification, achieving DERs of 11.30% in track 1 and 16.78% in track 2 on the evaluation set.
This system description describes our submission system to the Third DIHARD Speech Diarization Challenge. Besides the traditional clustering based system, the innovation of our system lies in the combination of various front-end techniques to solve the diarization problem, including speech separation and target-speaker based voice activity detection (TS-VAD), combined with iterative data purification. We also adopted audio domain classification to design domain-dependent processing. Finally, we performed post processing to do system fusion and selection. Our best system achieved DERs of 11.30% in track 1 and 16.78% in track 2 on evaluation set, respectively.