CL AISep 15, 2025

In-domain SSL pre-training and streaming ASR

Jarod Duret, Salima Mdhaffar, Gaëlle Laperrière, Ryan Whetten, Audrey Galametz, Catherine Kobus, Marion-Cécile Martin, Jo Oleiwan, Yannick Estève

arXiv:2509.12101v14.91 citationsh-index: 11SPECOM

Originality Incremental advance

AI Analysis

This work addresses the problem of accurate and low-latency ASR for safety-critical aviation applications, representing an incremental improvement through domain adaptation.

The study tackled improving automatic speech recognition (ASR) in Air Traffic Control (ATC) environments by using domain-specific self-supervised pre-training and streaming techniques, resulting in substantial reductions in word error rates compared to general-purpose models.

In this study, we investigate the benefits of domain-specific self-supervised pre-training for both offline and streaming ASR in Air Traffic Control (ATC) environments. We train BEST-RQ models on 4.5k hours of unlabeled ATC data, then fine-tune on a smaller supervised ATC set. To enable real-time processing, we propose using chunked attention and dynamic convolutions, ensuring low-latency inference. We compare these in-domain SSL models against state-of-the-art, general-purpose speech encoders such as w2v-BERT 2.0 and HuBERT. Results show that domain-adapted pre-training substantially improves performance on standard ATC benchmarks, significantly reducing word error rates when compared to models trained on broad speech corpora. Furthermore, the proposed streaming approach further improves word error rate under tighter latency constraints, making it particularly suitable for safety-critical aviation applications. These findings highlight that specializing SSL representations for ATC data is a practical path toward more accurate and efficient ASR systems in real-world operational settings.

View on arXiv PDF

Similar