SDASJun 3

Task-Vector Arithmetic for Emotional Expressivity Control in Language-Model-Based Text-to-Speech

arXiv:2606.0536712.7
Predicted impact top 52% in SD · last 90 daysOriginality Incremental advance
AI Analysis

For researchers and practitioners in expressive text-to-speech, this work offers a simple, training-free method to control emotional intensity in large-scale LM-TTS systems, circumventing prior incompatibility with token-based architectures.

The paper identifies the x-vector as the dominant carrier of emotional prosody in LM-TTS and proposes a training-free centroid arithmetic method for emotional expressivity control, achieving +0.29 emotion2vec cosine on English and +0.09 on Brazilian Portuguese while preserving speaker identity and intelligibility.

We investigate whether task-vector arithmetic, successful for cross-speaker emotional intensity control in modular text-to-speech (TTS), transfers to large-scale TTS systems built on language-model backbones with in-context learning (LM-TTS). Through a systematic elimination study over four progressively narrower operands on Qwen3-TTS-12Hz-1.7B - model weights via LoRA fine-tuning, continuous codec embeddings, discrete codec tokens, and the speaker embedding (x-vector) produced by an ECAPA-TDNN encoder jointly trained with the synthesis backbone - we localize the dominant carrier of emotional prosody to the x-vector. Building on this finding, we propose a training-free method based on centroid arithmetic in x-vector space: an emotion direction $τ= \mathbb{E}_i[x(s_i,\text{emo})] -\mathbb{E}_i[x(s_i,\text{neutral})]$ applied to an unseen target speaker as $x_{\text{new}} = x(\text{target},\text{neutral}) + α\cdotτ$. Using ESD (English) as the $τ$ source and emoUERJ (Brazilian Portuguese) as a cross-lingual ground-truth target, we observe average gains of $+0.29$ in emotion2vec cosine over the ICL baseline on English held-out speakers and $+0.09$ on Brazilian Portuguese held-out speakers, while largely preserving identity (WavLM SECS $\gtrsim 0.88$ for the multi-speaker $τ$ variant) and intelligibility (WER $\approx 0$ in PT-BR). These results offer initial evidence that the reported incompatibility of centroid-arithmetic style control with token-based TTS architectures may be circumvented when the arithmetic operates on the speaker embedding.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes