AS CL SDMar 13, 2023

Neural Diarization with Non-autoregressive Intermediate Attractors

Yusuke Fujita, Tatsuya Komatsu, Robin Scheibler, Yusuke Kida, Tetsuji Ogawa

arXiv:2303.06806v15.114 citationsh-index: 21Has Code

Originality Incremental advance

AI Analysis

This work addresses speaker diarization, a key task in speech processing, by improving label dependency in neural models, but it is incremental as it builds on existing EEND-EDA methods.

The paper tackles the problem of speaker diarization by proposing a novel end-to-end neural model that introduces label dependency between frames using non-autoregressive intermediate attractors, resulting in boosted performance on the two-speaker CALLHOME dataset with better training throughput than the baseline EEND-EDA.

End-to-end neural diarization (EEND) with encoder-decoder-based attractors (EDA) is a promising method to handle the whole speaker diarization problem simultaneously with a single neural network. While the EEND model can produce all frame-level speaker labels simultaneously, it disregards output label dependency. In this work, we propose a novel EEND model that introduces the label dependency between frames. The proposed method generates non-autoregressive intermediate attractors to produce speaker labels at the lower layers and conditions the subsequent layers with these labels. While the proposed model works in a non-autoregressive manner, the speaker labels are refined by referring to the whole sequence of intermediate labels. The experiments with the two-speaker CALLHOME dataset show that the intermediate labels with the proposed non-autoregressive intermediate attractors boost the diarization performance. The proposed method with the deeper network benefits more from the intermediate labels, resulting in better performance and training throughput than EEND-EDA.

View on arXiv PDF Code

Similar