Weibei Dou

SD
h-index6
3papers
31citations
Novelty38%
AI Score34

3 Papers

SDJul 11, 2025
FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation

Yuxuan Jiang, Zehua Chen, Zeqian Ju et al.

Text-to-audio (T2A) generation has achieved promising results with the recent advances in generative models. However, because of the limited quality and quantity of temporally-aligned audio-text pairs, existing T2A methods struggle to handle the complex text prompts that contain precise timing control, e.g., "owl hooted at 2.4s-5.2s". Recent works have explored data augmentation techniques or introduced timing conditions as model inputs to enable timing-conditioned 10-second T2A generation, while their synthesis quality is still limited. In this work, we propose a novel training-free timing-controlled T2A framework, FreeAudio, making the first attempt to enable timing-controlled long-form T2A generation, e.g., "owl hooted at 2.4s-5.2s and crickets chirping at 0s-24s". Specifically, we first employ an LLM to plan non-overlapping time windows and recaption each with a refined natural language description, based on the input text and timing prompts. Then we introduce: 1) Decoupling and Aggregating Attention Control for precise timing control; 2) Contextual Latent Composition for local smoothness and Reference Guidance for global consistency. Extensive experiments show that: 1) FreeAudio achieves state-of-the-art timing-conditioned T2A synthesis quality among training-free methods and is comparable to leading training-based methods; 2) FreeAudio demonstrates comparable long-form generation quality with training-based Stable Audio and paves the way for timing-controlled long-form T2A synthesis. Demo samples are available at: https://freeaudio.github.io/FreeAudio/

SDOct 10, 2025
ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling

Yuxuan Jiang, Zehua Chen, Zeqian Ju et al.

Text-to-audio (TTA) generation with fine-grained control signals, e.g., precise timing control or intelligible speech content, has been explored in recent works. However, constrained by data scarcity, their generation performance at scale is still compromised. In this study, we recast controllable TTA generation as a multi-task learning problem and introduce a progressive diffusion modeling approach, ControlAudio. Our method adeptly fits distributions conditioned on more fine-grained information, including text, timing, and phoneme features, through a step-by-step strategy. First, we propose a data construction method spanning both annotation and simulation, augmenting condition information in the sequence of text, timing, and phoneme. Second, at the model training stage, we pretrain a diffusion transformer (DiT) on large-scale text-audio pairs, achieving scalable TTA generation, and then incrementally integrate the timing and phoneme features with unified semantic representations, expanding controllability. Finally, at the inference stage, we propose progressively guided generation, which sequentially emphasizes more fine-grained information, aligning inherently with the coarse-to-fine sampling nature of DiT. Extensive experiments show that ControlAudio achieves state-of-the-art performance in terms of temporal accuracy and speech clarity, significantly outperforming existing methods on both objective and subjective evaluations. Demo samples are available at: https://control-audio.github.io/Control-Audio.

SDAug 30, 2018
MES-P: an Emotional Tonal Speech Dataset in Mandarin Chinese with Distal and Proximal Labels

Zhongzhe Xiao, Ying Chen, Weibei Dou et al.

Emotion shapes all aspects of our interpersonal and intellectual experiences. Its automatic analysis has there-fore many applications, e.g., human-machine interface. In this paper, we propose an emotional tonal speech dataset, namely Mandarin Chinese Emotional Speech Dataset - Portrayed (MES-P), with both distal and proximal labels. In contrast with state of the art emotional speech datasets which are only focused on perceived emotions, the proposed MES-P dataset includes not only perceived emotions with their proximal labels but also intended emotions with distal labels, thereby making it possible to study human emotional intelligence, i.e. people emotion expression ability and their skill of understanding emotions, thus explicitly accounting for perception differences between intended and perceived emotions in speech signals and enabling studies of emotional misunderstandings which often occur in real life. Furthermore, the proposed MES-P dataset also captures a main feature of tonal languages, i.e., tonal variations, and provides recorded emotional speech samples whose tonal variations match the tonal distribution in real life Mandarin Chinese. Besides, the proposed MES-P dataset features emotion intensity variations as well, and includes both moderate and intense versions of recordings for joy, anger, and sadness in addition to neutral speech. Ratings of the collected speech samples are made in valence-arousal space through continuous coordinate locations, resulting in an emotional distribution pattern in 2D VA space. The consistency between the speakers' emotional intentions and the listeners' perceptions is also studied using Cohen's Kappa coefficients. Finally, we also carry out extensive experiments using a baseline on MES-P for automatic emotion recognition and compare the results with human emotion intelligence.