AIMar 17

SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

arXiv:2603.1685992.51 citationsh-index: 15
AI Analysis

This addresses a critical gap in benchmarking for AI systems that need to navigate dynamic social cues in natural dialogues, though it is incremental as it builds on existing OLM frameworks.

The paper tackles the problem of evaluating social interactivity in omni-modal large language models by proposing SocialOmni, a benchmark that assesses conversational capabilities across speaker identification, interruption timing, and generation, revealing significant variance among 12 models and a decoupling between perceptual accuracy and interaction quality.

Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. To this end, we propose SocialOmni, a comprehensive benchmark that operationalizes the evaluation of this conversational interactivity across three core dimensions: (i) speaker separation and identification (who is speaking), (ii) interruption timing control (when to interject), and (iii) natural interruption generation (how to phrase the interruption). SocialOmni features 2,000 perception samples and a quality-controlled diagnostic set of 209 interaction-generation instances with strict temporal and contextual constraints, complemented by controlled audio-visual inconsistency scenarios to test model robustness. We benchmarked 12 leading OLMs, which uncovers significant variance in their social-interaction capabilities across models. Furthermore, our analysis reveals a pronounced decoupling between a model's perceptual accuracy and its ability to generate contextually appropriate interruptions, indicating that understanding-centric metrics alone are insufficient to characterize conversational social competence. More encouragingly, these diagnostics from SocialOmni yield actionable signals for bridging the perception-interaction divide in future OLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes