CL AI HC SD ASNov 5, 2025

Step-Audio-EditX Technical Report

Chao Yan, Boyong Wu, Peng Yang, Pengfei Tan, Guoqiang Hu, Li Xie, Yuxin Zhang, Xiangyu, Zhang, Fei Tian, Xuerui Yang, Xiangyu Zhang

arXiv:2511.03601v28.36 citationsh-index: 10Has Code

Originality Highly original

AI Analysis

This addresses the need for advanced audio editing tools for content creators and researchers, offering a novel approach that is not incremental.

The paper tackled the problem of expressive and iterative audio editing, including emotion and speaking style, by introducing Step-Audio-EditX, an open-source LLM-based audio model that outperforms competitors like MiniMax-2.6-hd and Doubao-Seed-TTS-2.0 in emotion editing tasks.

We present Step-Audio-EditX, the first open-source LLM-based audio model excelling at expressive and iterative audio editing encompassing emotion, speaking style, and paralinguistics alongside robust zero-shot text-to-speech (TTS) capabilities. Our core innovation lies in leveraging only large-margin synthetic data, which circumvents the need for embedding-based priors or auxiliary modules. This large-margin learning approach enables both iterative control and high expressivity across voices, and represents a fundamental pivot from the conventional focus on representation-level disentanglement. Evaluation results demonstrate that Step-Audio-EditX surpasses both MiniMax-2.6-hd and Doubao-Seed-TTS-2.0 in emotion editing and other fine-grained control tasks.

View on arXiv PDF

Similar