SDAIJan 27

Residual Tokens Enhance Masked Autoencoders for Speech Modeling

arXiv:2601.19399v1h-index: 31
Originality Incremental advance
AI Analysis

This work addresses the limitation of attribute-based speech modeling for applications requiring natural and expressive speech synthesis or enhancement, though it appears incremental as it builds on existing masked autoencoder frameworks.

The paper tackles the problem of capturing the full richness of natural speech beyond explicit attributes like pitch and speaker identity by introducing RT-MAE, a masked autoencoder framework that uses unsupervised residual tokens to encode unexplained information such as timbre and emotion, resulting in improved reconstruction quality, enhanced expressivity, and applicability to speech enhancement while maintaining controllability and naturalness.

Recent speech modeling relies on explicit attributes such as pitch, content, and speaker identity, but these alone cannot capture the full richness of natural speech. We introduce RT-MAE, a novel masked autoencoder framework that augments the supervised attributes-based modeling with unsupervised residual trainable tokens, designed to encode the information not explained by explicit labeled factors (e.g., timbre variations, noise, emotion etc). Experiments show that RT-MAE improves reconstruction quality, preserving content and speaker similarity while enhancing expressivity. We further demonstrate its applicability to speech enhancement, removing noise at inference while maintaining controllability and naturalness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes