SDMay 22

AffectCodec: Emotion-Preserving Neural Speech Codec with Block-Diagonal Residual FSQ

Zhaoyang Meng, Zhengyao Ma, Kecan Mao, Yingming Gao, Ya Li

arXiv:2605.2337313.9

Predicted impact top 52% in SD · last 90 daysOriginality Highly original

AI Analysis

This work addresses the problem of emotion loss in neural speech codecs for downstream speech language models, offering a principled solution for attribute-aware compression.

AffectCodec introduces an emotion-preserving neural speech codec using Block-Diagonal Residual Finite Scalar Quantization (BD-RFSQ) to structurally protect emotion-relevant information during quantization, achieving substantial improvements in emotion preservation at low bitrates while maintaining competitive acoustic quality and intelligibility.

Neural speech codecs have become the discrete interface between raw audio and speech language models, yet they remain optimized primarily for acoustic reconstruction fidelity, which leaves emotion-relevant cues vulnerable to being discarded during quantization, limiting the affective capacity of downstream models. We trace this degradation to two mechanisms: reconstruction-driven bit allocation under limited bitrate and cross-stream leakage in concatenation-based codecs, where acoustic gradients can overwrite nominally emotion-reserved dimensions. We propose AffectCodec, an emotion-preserving neural speech codec built on Block-Diagonal Residual Finite Scalar Quantization (BD-RFSQ). By imposing block-diagonal input and output projections over emotion and acoustic subspaces, BD-RFSQ transforms bit allocation from implicit and loss-driven to explicit and structurally guaranteed, while still preserving a flat token interface for downstream speech language models. AffectCodec further combines this structurally constrained quantizer with multi-granularity emotion conditioning and multi-rate training, enabling robust affect preservation at low bitrates. Experiments across multiple emotional speech benchmarks show that AffectCodec substantially improves emotion preservation, especially in the low-bitrate regime, while maintaining competitive acoustic quality and intelligibility. These results suggest that structurally protected quantization is an effective principle for preserving emotion-relevant information and may provide a general route toward attribute-aware neural speech compression.

View on arXiv PDF

Similar