CLNov 6, 2025

WST: Weakly Supervised Transducer for Automatic Speech Recognition

arXiv:2511.04035v1h-index: 63
Originality Highly original
AI Analysis

This addresses the costly data annotation bottleneck for ASR systems, offering a robust solution for industrial applications.

The paper tackles the problem of reducing reliance on high-quality annotated data in automatic speech recognition by proposing a Weakly Supervised Transducer (WST), which maintains performance with transcription error rates up to 70% and outperforms existing weakly supervised methods like BTC and OTC.

The Recurrent Neural Network-Transducer (RNN-T) is widely adopted in end-to-end (E2E) automatic speech recognition (ASR) tasks but depends heavily on large-scale, high-quality annotated data, which are often costly and difficult to obtain. To mitigate this reliance, we propose a Weakly Supervised Transducer (WST), which integrates a flexible training graph designed to robustly handle errors in the transcripts without requiring additional confidence estimation or auxiliary pre-trained models. Empirical evaluations on synthetic and industrial datasets reveal that WST effectively maintains performance even with transcription error rates of up to 70%, consistently outperforming existing Connectionist Temporal Classification (CTC)-based weakly supervised approaches, such as Bypass Temporal Classification (BTC) and Omni-Temporal Classification (OTC). These results demonstrate the practical utility and robustness of WST in realistic ASR settings. The implementation will be publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes