SDAIASJul 4, 2024

Serialized Output Training by Learned Dominance

arXiv:2407.03966v18 citationsh-index: 19
Originality Incremental advance
AI Analysis

This work addresses the challenge of accurately transcribing overlapping speech for applications like meeting transcription, representing an incremental improvement over existing methods.

The paper tackled the label-permutation problem in multi-talker speech recognition by introducing a model-based serialization strategy that autonomously orders speech components based on dominance factors like loudness and gender, achieving significant performance improvements over PIT and FIFO baselines on LibriSpeech and LibriMix databases in 2-mix and 3-mix scenarios.

Serialized Output Training (SOT) has showcased state-of-the-art performance in multi-talker speech recognition by sequentially decoding the speech of individual speakers. To address the challenging label-permutation issue, prior methods have relied on either the Permutation Invariant Training (PIT) or the time-based First-In-First-Out (FIFO) rule. This study presents a model-based serialization strategy that incorporates an auxiliary module into the Attention Encoder-Decoder architecture, autonomously identifying the crucial factors to order the output sequence of the speech components in multi-talker speech. Experiments conducted on the LibriSpeech and LibriMix databases reveal that our approach significantly outperforms the PIT and FIFO baselines in both 2-mix and 3-mix scenarios. Further analysis shows that the serialization module identifies dominant speech components in a mixture by factors including loudness and gender, and orders speech components based on the dominance score.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes