CLSDASFeb 2, 2022

ASR-Aware End-to-end Neural Diarization

arXiv:2202.01286v116 citations
Originality Incremental advance
AI Analysis

This work addresses speaker diarization for conversational speech analysis, presenting an incremental improvement by combining acoustic and ASR features.

The paper tackled speaker diarization by integrating ASR-derived features into a Conformer-based end-to-end neural diarization model, resulting in a 20% relative reduction in diarization error rate on two-speaker English conversations.

We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model. Two categories of features are explored: features derived directly from ASR output (phones, position-in-word and word boundaries) and features derived from a lexical speaker change detection model, trained by fine-tuning a pretrained BERT model on the ASR output. Three modifications to the Conformer-based EEND architecture are proposed to incorporate the features. First, ASR features are concatenated with acoustic features. Second, we propose a new attention mechanism called contextualized self-attention that utilizes ASR features to build robust speaker representations. Finally, multi-task learning is used to train the model to minimize classification loss for the ASR features along with diarization loss. Experiments on the two-speaker English conversations of Switchboard+SRE data sets show that multi-task learning with position-in-word information is the most effective way of utilizing ASR features, reducing the diarization error rate (DER) by 20% relative to the baseline.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes