CLOct 27, 2025

A Cocktail-Party Benchmark: Multi-Modal dataset and Comparative Evaluation Results

Thai-Binh Nguyen, Katerina Zmolikova, Pingchuan Ma, Ngoc Quan Pham, Christian Fuegen, Alexander Waibel

arXiv:2510.23276v13 citationsh-index: 21

Originality Synthesis-oriented

AI Analysis

This addresses the challenge of multi-party speech recognition in noisy environments for applications like transcription and social interaction analysis, but it is incremental as it builds on existing multi-modal benchmarks.

The paper tackles the cocktail-party problem of overlapping conversations by introducing a multi-modal dataset and task for recognizing who speaks what and when, showing that incorporating visual cues reduces word error rates by 50% compared to audio-only baselines.

We introduce the task of Multi-Modal Context-Aware Recognition (MCoRec) in the ninth CHiME Challenge, which addresses the cocktail-party problem of overlapping conversations in a single-room setting using audio, visual, and contextual cues. MCoRec captures natural multi-party conversations where the recordings focus on unscripted, casual group chats, leading to extreme speech overlap of up to 100% and highly fragmented conversational turns. The task requires systems to answer the question "Who speaks when, what, and with whom?" by jointly transcribing each speaker's speech and clustering them into their respective conversations from audio-visual recordings. Audio-only baselines exceed 100% word error rate, whereas incorporating visual cues yields substantial 50% improvements, highlighting the importance of multi-modality. In this manuscript, we present the motivation behind the task, outline the data collection process, and report the baseline systems developed for the MCoRec.

View on arXiv PDF

Similar