SDASMar 17

CAST-TTS: A Simple Cross-Attention Framework for Unified Timbre Control in TTS

arXiv:2603.1628045.7h-index: 18
AI Analysis

This addresses the challenge of cross-modal alignment for timbre control in TTS, offering a unified solution that is incremental over existing separate models.

The paper tackles the problem of unifying speech-prompted and text-prompted timbre control in Text-to-Speech systems by proposing CAST-TTS, a simple cross-attention framework that achieves performance comparable to specialized models.

Current Text-to-Speech (TTS) systems typically use separate models for speech-prompted and text-prompted timbre control. While unifying both control signals into a single model is desirable, the challenge of cross-modal alignment often results in overly complex architectures and training objective. To address this challenge, we propose CAST-TTS, a simple yet effective framework for unified timbre control. Features are extracted from speech prompts and text prompts using pre-trained encoders. The multi-stage training strategy efficiently aligns the speech and projected text representations within a shared embedding space. A single cross-attention mechanism then allows the model to use either of these representations to control the timbre. Extensive experiments validate that the unified cross-attention mechanism is critical for achieving high-quality synthesis. CAST-TTS achieves performance comparable to specialized single-input models while operating within a unified architecture. The demo page can be accessed at https://HiRookie9.github.io/CAST-TTS-Page.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes