SDMay 11

AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling

Jiacheng Shi, Hongfei Du, Xinyuan Song, Y. Alicia Hong, Yanfu Zhang, Ye Gao

arXiv:2605.1109864.2

Predicted impact top 37% in SD · last 90 daysOriginality Incremental advance

AI Analysis

For speech modeling and generation tasks, this work addresses the degradation of emotional cues in neural speech codecs, which is a known bottleneck for expressive speech synthesis.

They propose an emotion-guided neural speech codec that explicitly preserves emotional information during quantization, improving emotion consistency and perceptual quality in speech reconstruction, emotion recognition, and downstream text-to-speech without sacrificing content accuracy.

Neural speech codecs provide discrete representations for speech language models, but emotional cues are often degraded during quantization. Existing codecs mainly optimize acoustic reconstruction, leaving emotion expressiveness insufficiently modeled at the representation level. We propose an emotion-guided neural speech codec that explicitly preserves emotional information while maintaining semantic fidelity and prosodic naturalness. Our framework combines emotion-semantic guided latent modulation, relation-preserving emotional-semantic distillation, and emotion-weighted semantic alignment to retain emotionally salient cues under compression. Extensive evaluations across speech reconstruction, emotion recognition, and downstream text-to-speech generation demonstrate improved emotion consistency and perceptual quality without sacrificing content accuracy.

View on arXiv PDF

Similar