CLAug 6, 2025

RealTalk-CN: A Realistic Chinese Speech-Text Dialogue Benchmark With Cross-Modal Interaction Analysis

arXiv:2508.10015v1h-index: 10
Originality Synthesis-oriented
AI Analysis

This addresses a gap in Chinese speech-based task-oriented dialogue research, providing a benchmark for evaluating models on real-world complexities like disfluencies and speaker variations, though it is incremental as it extends existing dataset efforts to a new language and modality.

The paper tackles the lack of realistic Chinese speech-text dialogue datasets for evaluating speech-based large language models by introducing RealTalk-CN, a multi-turn, multi-domain dataset with 5.4k dialogues and 150 hours of paired speech-text annotations, and shows its effectiveness in robustness and cross-domain evaluations.

In recent years, large language models (LLMs) have achieved remarkable advancements in multimodal processing, including end-to-end speech-based language models that enable natural interactions and perform specific tasks in task-oriented dialogue (TOD) systems. However, existing TOD datasets are predominantly text-based, lacking real speech signals that are essential for evaluating the robustness of speech-based LLMs. Moreover, existing speech TOD datasets are primarily English and lack critical aspects such as speech disfluencies and speaker variations. To address these gaps, we introduce RealTalk-CN, the first Chinese multi-turn, multi-domain speech-text dual-modal TOD dataset, comprising 5.4k dialogues (60K utterances, 150 hours) with paired speech-text annotations. RealTalk-CN captures diverse dialogue scenarios with annotated spontaneous speech disfluencies, ensuring comprehensive coverage of real-world complexities in speech dialogue. In addition, we propose a novel cross-modal chat task that authentically simulates real-world user interactions, allowing dynamic switching between speech and text modalities. Our evaluation covers robustness to speech disfluencies, sensitivity to speaker characteristics, and cross-domain performance. Extensive experiments validate the effectiveness of RealTalk-CN, establishing a strong foundation for Chinese speech-based LLMs research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes