ASAICLLGMMSDApr 25, 2025

Kimi-Audio Technical Report

arXiv:2504.18425v1190 citationsh-index: 19Has Code
Originality Incremental advance
AI Analysis

This work provides an open-source model for audio understanding, generation, and conversation, addressing needs in audio AI applications, though it appears incremental as it builds on existing LLM and audio tokenizer techniques.

The authors tackled the challenge of creating a versatile audio foundation model by developing Kimi-Audio, which achieves state-of-the-art performance on multiple audio benchmarks including speech recognition and audio question answering.

We present Kimi-Audio, an open-source audio foundation model that excels in audio understanding, generation, and conversation. We detail the practices in building Kimi-Audio, including model architecture, data curation, training recipe, inference deployment, and evaluation. Specifically, we leverage a 12.5Hz audio tokenizer, design a novel LLM-based architecture with continuous features as input and discrete tokens as output, and develop a chunk-wise streaming detokenizer based on flow matching. We curate a pre-training dataset that consists of more than 13 million hours of audio data covering a wide range of modalities including speech, sound, and music, and build a pipeline to construct high-quality and diverse post-training data. Initialized from a pre-trained LLM, Kimi-Audio is continual pre-trained on both audio and text data with several carefully designed tasks, and then fine-tuned to support a diverse of audio-related tasks. Extensive evaluation shows that Kimi-Audio achieves state-of-the-art performance on a range of audio benchmarks including speech recognition, audio understanding, audio question answering, and speech conversation. We release the codes, model checkpoints, as well as the evaluation toolkits in https://github.com/MoonshotAI/Kimi-Audio.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes