SD AI ASJan 26

VIBEVOICE-ASR Technical Report

Zhiliang Peng, Jianwei Yu, Yaoyao Chang, Zilong Wang, Li Dong, Yingbo Hao, Yujie Tu, Chenyu Yang, Wenhui Wang, Songchen Xu, Yutao Sun, Hangbo Bao

arXiv:2601.18184v111.912 citationsh-index: 23

Originality Highly original

AI Analysis

This addresses challenges in speech recognition for long-form audio like meetings and podcasts, offering a more integrated solution compared to traditional pipelined methods.

The paper tackles context fragmentation and multi-speaker complexity in long-form audio by introducing VibeVoice-ASR, a general-purpose speech understanding framework that unifies ASR, speaker diarization, and timestamping into a single end-to-end task, supporting single-pass processing for up to 60 minutes of audio and over 50 languages with code-switching.

This report presents VibeVoice-ASR, a general-purpose speech understanding framework built upon VibeVoice, designed to address the persistent challenges of context fragmentation and multi-speaker complexity in long-form audio (e.g., meetings, podcasts) that remain despite recent advancements in short-form speech recognition. Unlike traditional pipelined approaches that rely on audio chunking, VibeVoice-ASRsupports single-pass processing for up to 60 minutes of audio. It unifies Automatic Speech Recognition, Speaker Diarization, and Timestamping into a single end-to-end generation task. In addition, VibeVoice-ASR supports over 50 languages, requires no explicit language setting, and natively handles code-switching within and across utterances. Furthermore, we introduce a prompt-based context injection mechanism that allows users to supply customized conetxt, significantly improving accuracy on domain-specific terminology and polyphonic character disambiguation.

View on arXiv PDF

Similar