CLLGSep 12, 2025

WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers

arXiv:2509.10452v11 citationsh-index: 5
Originality Incremental advance
AI Analysis

This addresses the challenge of domain adaptation for ASR models in settings where collecting speech data is impractical, offering a text-only solution that improves performance without extra runtime cost.

The paper tackled the problem of adapting pretrained speech recognition models to new domains using only text data, achieving a 12.3% relative reduction in word error rate compared to TTS-only adaptation and outperforming baselines in 27 out of 32 scenarios.

Pretrained automatic speech recognition (ASR) models such as Whisper perform well but still need domain adaptation to handle unseen vocabulary and parlance. In many real-world settings, collecting speech data is impractical, necessitating text-only adaptation. We propose WhisTLE, a deeply supervised, text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE trains a variational autoencoder (VAE) to model encoder outputs from text and fine-tunes the decoder using the learned text-to-latent encoder, optionally combined with text-to-speech (TTS) adaptation. At inference, the original encoder is restored, incurring no extra runtime cost. Across four out-of-domain datasets and four ASR models, WhisTLE with TTS reduces word error rate (WER) by 12.3% relative to TTS-only adaptation and outperforms all non-WhisTLE baselines in 27 of 32 scenarios.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes