CL AI SDSep 11, 2025

EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs

Yuhao Zhang, Yuhao Du, Zhanchen Dai, Xiangnan Ma, Kaiqi Kou, Benyou Wang, Haizhou Li

arXiv:2509.09174v13 citationsh-index: 4Has Code

Originality Highly original

AI Analysis

This addresses the problem of reduced capabilities in speech-based AI models for applications requiring knowledge and reasoning, representing a novel method rather than an incremental improvement.

The paper tackles the degradation of knowledge and reasoning in speech-to-speech large language models (SLLMs) by proposing EchoX, a method that bridges the acoustic-semantic gap through semantic representations and dynamic speech target generation, achieving advanced performance on multiple knowledge-based question-answering benchmarks with about six thousand hours of training data.

Speech-to-speech large language models (SLLMs) are attracting increasing attention. Derived from text-based large language models (LLMs), SLLMs often exhibit degradation in knowledge and reasoning capabilities. We hypothesize that this limitation arises because current training paradigms for SLLMs fail to bridge the acoustic-semantic gap in the feature representation space. To address this issue, we propose EchoX, which leverages semantic representations and dynamically generates speech training targets. This approach integrates both acoustic and semantic learning, enabling EchoX to preserve strong reasoning abilities as a speech LLM. Experimental results demonstrate that EchoX, with about six thousand hours of training data, achieves advanced performance on multiple knowledge-based question-answering benchmarks. The project is available at https://github.com/FreedomIntelligence/EchoX.

View on arXiv PDF Code

Similar