AS SDMar 13

MAGE: A Coarse-to-Fine Speech Enhancer with Masked Generative Model

The Hieu Pham, Tan Dat Nguyen, Phuong Thanh Tran, Joon Son Chung, Duc Dung Nguyen

arXiv:2509.1988111.0

Predicted impact top 61% in AS · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses speech enhancement for applications like recognition, offering incremental improvements in efficiency and quality over existing methods.

The paper tackles speech enhancement by introducing MAGE, a masked generative model that uses a coarse-to-fine masking strategy and a corrector module to improve efficiency and perceptual quality, achieving state-of-the-art results on benchmarks like DNS Challenge and noisy LibriSpeech with reduced word error rates.

Speech enhancement remains challenging due to the trade-off between efficiency and perceptual quality. In this paper, we introduce MAGE, a Masked Audio Generative Enhancer that advances generative speech enhancement through a compact and robust design. Unlike prior masked generative models with random masking, MAGE employs a scarcity-aware coarse-to-fine masking strategy that prioritizes frequent tokens in early steps and rare tokens in later refinements, improving efficiency and generalization. We also propose a lightweight corrector module that further stabilizes inference by detecting low-confidence predictions and re-masking them for refinement. Built on BigCodec and finetuned from Qwen2.5-0.5B, MAGE is reduced to 200M parameters through selective layer retention. Experiments on DNS Challenge and noisy LibriSpeech show that MAGE achieves state-of-the-art perceptual quality and significantly reduces word error rate for downstream recognition, outperforming larger baselines. Audio examples are available at https://hieugiaosu.github.io/MAGE/.

View on arXiv PDF

Similar