CLSDASJun 4, 2025

Acoustically Precise Hesitation Tagging Is Essential for End-to-End Verbatim Transcription Systems

arXiv:2506.04076v21 citationsh-index: 1Slate
Originality Incremental advance
AI Analysis

This work addresses the need for accurate disfluency capture in ASR systems for L2 speech transcription, offering incremental improvements in verbatim transcription accuracy.

The paper tackled the problem of verbatim transcription for automatic speaking assessment by fine-tuning Whisper models with acoustically precise hesitation tagging, achieving a 5.5% WER and an 11.3% relative improvement over baseline methods.

Verbatim transcription for automatic speaking assessment demands accurate capture of disfluencies, crucial for downstream tasks like error analysis and feedback. However, many ASR systems discard or generalize hesitations, losing important acoustic details. We fine-tune Whisper models on the Speak & Improve 2025 corpus using low-rank adaptation (LoRA), without recourse to external audio training data. We compare three annotation schemes: removing hesitations (Pure), generic tags (Rich), and acoustically precise fillers inferred by Gemini 2.0 Flash from existing audio-transcript pairs (Extra). Our challenge system achieved 6.47% WER (Pure) and 5.81% WER (Extra). Post-challenge experiments reveal that fine-tuning Whisper Large V3 Turbo with the "Extra" scheme yielded a 5.5% WER, an 11.3% relative improvement over the "Pure" scheme (6.2% WER). This demonstrates that explicit, realistic filled-pause labeling significantly enhances ASR accuracy for verbatim L2 speech transcription.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes