TOGGL: Transcribing Overlapping Speech with Staggered Labeling
This addresses the challenge of transcribing overlapping speech for applications like conversation analysis, though it is incremental as it builds on prior joint separation-transcription methods.
The paper tackles the problem of transcribing overlapping speech by proposing the TOGGL model, which simultaneously transcribes multiple speakers using a single decoder with special output tokens, achieving superior performance on a conversational speech dataset and improving results on single-speaker audio.
Transcribing the speech of multiple overlapping speakers typically requires separating the audio into multiple streams and recognizing each one independently. More recent work jointly separates and transcribes, but requires a separate decoding component for each speaker. We propose the TOGGL model to simultaneously transcribe the speech of multiple speakers. The TOGGL model uses special output tokens to attribute the speech to each speaker with only a single decoder. Our approach generalizes beyond two speakers, even when trained only on two-speaker data. We demonstrate superior performance compared to competing approaches on a conversational speech dataset. Our approach also improves performance on single-speaker audio.