End-to-End Supervised Hierarchical Graph Clustering for Speaker Diarization
This addresses speaker diarization for speech processing applications, presenting an incremental improvement over existing methods.
The paper tackles speaker diarization by proposing E-SHARC, an end-to-end supervised hierarchical clustering algorithm based on graph neural networks that jointly optimizes embedding extraction and clustering. The method achieves competitive results on benchmark datasets like AMI, Voxconverse, and DISPLACE.
Speaker diarization, the task of segmenting an audio recording based on speaker identity, constitutes an important speech pre-processing step for several downstream applications.The conventional approach to diarization involves multiple steps of embedding extraction and clustering, which are often optimized in an isolated fashion. While end-to-end diarization systems attempt to learn a single model for the task, they are often cumbersome to train and require large supervised datasets. In this paper, we propose an end-to-end supervised hierarchical clustering algorithm based on graph neural networks (GNN), called End-to-end Supervised HierARchical Clustering (E-SHARC). The embedding extractor is initialized using a pre-trained x-vector model while the GNN model is trained initially using the x-vector embeddings from the pre-trained model. Finally, the E-SHARC model uses the front-end mel-filterbank features as input and jointly optimizes the embedding extractor and the GNN clustering module, performing representation learning, metric learning, and clustering with end-to-end optimization. Further, with additional inputs from an external overlap detector, the E-SHARC approach is capable of predicting the speakers in the overlapping speech regions. The experimental evaluation on benchmark datasets like AMI, Voxconverse and DISPLACE, illustrates that the proposed E-SHARC framework provides competitive diarization results using graph based clustering methods.