LGAug 29, 2024

CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions

arXiv:2408.16589v145 citationsh-index: 4Has Code
Originality Incremental advance
AI Analysis

This work addresses the need for precise timestamps in verbatim speech transcriptions, which is incremental as it builds on the existing Whisper model with specific adjustments.

The authors tackled the problem of improving word-level timestamp accuracy in speech transcription by adjusting Whisper's tokenizer and fine-tuning for verbatim output, achieving state-of-the-art performance on benchmarks for verbatim transcription, word segmentation, and filler event detection.

We demonstrate that carefully adjusting the tokenizer of the Whisper speech recognition model significantly improves the precision of word-level timestamps when applying dynamic time warping to the decoder's cross-attention scores. We fine-tune the model to produce more verbatim speech transcriptions and employ several techniques to increase robustness against multiple speakers and background noise. These adjustments achieve state-of-the-art performance on benchmarks for verbatim speech transcription, word segmentation, and the timed detection of filler events, and can further mitigate transcription hallucinations. The code is available open https://github.com/nyrahealth/CrisperWhisper.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes