SDCLOct 25, 2025

M-CIF: Multi-Scale Alignment For CIF-Based Non-Autoregressive ASR

arXiv:2510.22172v1h-index: 14
Originality Incremental advance
AI Analysis

This work addresses a specific bottleneck in non-autoregressive ASR for multiple languages, offering incremental improvements in alignment stability and error reduction.

The paper tackles the instability of the Continuous Integrate-and-Fire (CIF) mechanism in non-autoregressive speech recognition for languages like English and French by proposing Multi-scale CIF (M-CIF), which integrates multi-level alignment with character and phoneme supervision, resulting in reduced word error rates, such as 4.21% in German and 3.05% in French on CommonVoice compared to a baseline.

The Continuous Integrate-and-Fire (CIF) mechanism provides effective alignment for non-autoregressive (NAR) speech recognition. This mechanism creates a smooth and monotonic mapping from acoustic features to target tokens, achieving performance on Mandarin competitive with other NAR approaches. However, without finer-grained guidance, its stability degrades in some languages such as English and French. In this paper, we propose Multi-scale CIF (M-CIF), which performs multi-level alignment by integrating character and phoneme level supervision progressively distilled into subword representations, thereby enhancing robust acoustic-text alignment. Experiments show that M-CIF reduces WER compared to the Paraformer baseline, especially on CommonVoice by 4.21% in German and 3.05% in French. To further investigate these gains, we define phonetic confusion errors (PE) and space-related segmentation errors (SE) as evaluation metrics. Analysis of these metrics across different M-CIF settings reveals that the phoneme and character layers are essential for enhancing progressive CIF alignment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes