CLJan 19

Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition

arXiv:2601.13044v11 citations
Originality Incremental advance
AI Analysis

This work addresses a critical gap in real-time Thai ASR for applications requiring low latency, though it is incremental in adapting existing methods to a specific language and domain.

The authors tackled the lack of efficient streaming solutions for Thai automatic speech recognition by developing a compact FastConformer-Transducer model, achieving a 45x reduction in computational cost compared to Whisper Large-v3 while maintaining comparable accuracy through rigorous text normalization.

Large encoder-decoder models like Whisper achieve strong offline transcription but remain impractical for streaming applications due to high latency. However, due to the accessibility of pre-trained checkpoints, the open Thai ASR landscape remains dominated by these offline architectures, leaving a critical gap in efficient streaming solutions. We present Typhoon ASR Real-time, a 115M-parameter FastConformer-Transducer model for low-latency Thai speech recognition. We demonstrate that rigorous text normalization can match the impact of model scaling: our compact model achieves a 45x reduction in computational cost compared to Whisper Large-v3 while delivering comparable accuracy. Our normalization pipeline resolves systemic ambiguities in Thai transcription --including context-dependent number verbalization and repetition markers (mai yamok) --creating consistent training targets. We further introduce a two-stage curriculum learning approach for Isan (north-eastern) dialect adaptation that preserves Central Thai performance. To address reproducibility challenges in Thai ASR, we release the Typhoon ASR Benchmark, a gold-standard human-labeled datasets with transcriptions following established Thai linguistic conventions, providing standardized evaluation protocols for the research community.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes