Modeling Overlapped Speech with Shuffles
This addresses the challenge of speaker-attributed transcription in overlapped speech, which is incremental as it builds on existing methods with a novel algorithmic approach.
The paper tackles the problem of aligning and transcribing overlapped speech by modeling parallel data streams using shuffles and partial order finite-state automata, achieving single-pass alignment for multi-talker recordings.
We propose to model parallel streams of data, such as overlapped speech, using shuffles. Specifically, this paper shows how the shuffle product and partial order finite-state automata (FSAs) can be used for alignment and speaker-attributed transcription of overlapped speech. We train using the total score on these FSAs as a loss function, marginalizing over all possible serializations of overlapping sequences at subword, word, and phrase levels. To reduce graph size, we impose temporal constraints by constructing partial order FSAs. We address speaker attribution by modeling (token, speaker) tuples directly. Viterbi alignment through the shuffle product FSA directly enables one-pass alignment. We evaluate performance on synthetic LibriSpeech overlaps. To our knowledge, this is the first algorithm that enables single-pass alignment of multi-talker recordings. All algorithms are implemented using k2 / Icefall.