Prompting Whisper for Joint Speech Transcription and Diarization
This work addresses the need for real-time transcription and diarization in medical conversations, but the results are preliminary and the approach is incremental.
The authors explore prompting and fine-tuning Whisper for joint speech transcription and diarization of Dutch medical conversations. Fine-tuning with speaker-labelled prompts improves speaker ID consistency and verbatim transcription, but performance is limited by error propagation and inaccurate timestamps for overlapping speech.
As part of the MediSpeech project, we aim to develop a system that transcribes and diarizes Dutch conversations between doctors and patients in real-time. In this research (in-progress) we explore ways of efficiently combining Whisper with speaker diarization (SD). After trying to prompt Whisper with text that contains speaker labels, we observed that it is able to insert labels into the transcription with promising accuracy. We continued this line of research by fine-tuning Whisper with speaker-labelled prompts to generate transcriptions in a format similar to that of Serialized Output Training (SOT). Fine-tuning Whisper yielded more consistent speaker IDs across the chunks of long-form audio and improved verbatim transcription. The study uncovered new challenges as Whisper's SD performance suffers because of mistakes that get propagated through prompts and inaccurate timestamps assigned to overlapping speech.