AS CLSep 12, 2025

Whisper Has an Internal Word Aligner

arXiv:2509.09987v12.32 citationsh-index: 9

Originality Incremental advance

AI Analysis

This provides a more accurate and training-free solution for extracting word alignments from Whisper, which is useful for applications requiring precise speech-to-text synchronization, though it is incremental as it builds on existing Whisper capabilities.

The paper tackled the problem of obtaining accurate word-level timestamps from Whisper, an automatic speech recognizer, by discovering that certain attention heads in Whisper capture precise alignments and using characters instead of wordpieces improves accuracy, resulting in an unsupervised method that outperforms prior work under stricter tolerances of 20-100 ms.

There is an increasing interest in obtaining accurate word-level timestamps from strong automatic speech recognizers, in particular Whisper. Existing approaches either require additional training or are simply not competitive. The evaluation in prior work is also relatively loose, typically using a tolerance of more than 200 ms. In this work, we discover attention heads in Whisper that capture accurate word alignments and are distinctively different from those that do not. Moreover, we find that using characters produces finer and more accurate alignments than using wordpieces. Based on these findings, we propose an unsupervised approach to extracting word alignments by filtering attention heads while teacher forcing Whisper with characters. Our approach not only does not require training but also produces word alignments that are more accurate than prior work under a stricter tolerance between 20 ms and 100 ms.

View on arXiv PDF

Similar