CLSDASFeb 24, 2025

Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction

arXiv:2502.17239v178 citationsh-index: 13Has Code
Originality Incremental advance
AI Analysis

This work addresses the need for unified speech interaction systems, offering a versatile tool for applications like conversational AI, but it appears incremental as it builds on existing pre-trained models and techniques.

The authors tackled the problem of integrating audio understanding and generation into a single model by introducing Baichuan-Audio, an end-to-end audio large language model that achieves real-time speech interaction with comprehension and generation capabilities, demonstrating superior performance in spoken dialogue and question-answering.

We introduce Baichuan-Audio, an end-to-end audio large language model that seamlessly integrates audio understanding and generation. It features a text-guided aligned speech generation mechanism, enabling real-time speech interaction with both comprehension and generation capabilities. Baichuan-Audio leverages a pre-trained ASR model, followed by multi-codebook discretization of speech at a frame rate of 12.5 Hz. This multi-codebook setup ensures that speech tokens retain both semantic and acoustic information. To further enhance modeling, an independent audio head is employed to process audio tokens, effectively capturing their unique characteristics. To mitigate the loss of intelligence during pre-training and preserve the original capabilities of the LLM, we propose a two-stage pre-training strategy that maintains language understanding while enhancing audio modeling. Following alignment, the model excels in real-time speech-based conversation and exhibits outstanding question-answering capabilities, demonstrating its versatility and efficiency. The proposed model demonstrates superior performance in real-time spoken dialogue and exhibits strong question-answering abilities. Our code, model and training data are available at https://github.com/baichuan-inc/Baichuan-Audio

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes