ASCLHCLGMMJun 1, 2025

Multimodal Fusion with Semi-Supervised Learning Minimizes Annotation Quantity for Modeling Videoconference Conversation Experience

arXiv:2506.13971v1h-index: 6INTERSPEECH
Originality Incremental advance
AI Analysis

This provides an annotation-efficient framework for modeling videoconference experience, addressing a domain-specific problem with incremental improvements in data efficiency.

The paper tackled the problem of predicting negative experiences in videoconference conversations by applying semi-supervised learning with multimodal fusion, achieving an ROC-AUC of 0.9 and matching 96% of supervised learning performance with only 8% labeled data.

Group conversations over videoconferencing are a complex social behavior. However, the subjective moments of negative experience, where the conversation loses fluidity or enjoyment remain understudied. These moments are infrequent in naturalistic data, and thus training a supervised learning (SL) model requires costly manual data annotation. We applied semi-supervised learning (SSL) to leverage targeted labeled and unlabeled clips for training multimodal (audio, facial, text) deep features to predict non-fluid or unenjoyable moments in holdout videoconference sessions. The modality-fused co-training SSL achieved an ROC-AUC of 0.9 and an F1 score of 0.6, outperforming SL models by up to 4% with the same amount of labeled data. Remarkably, the best SSL model with just 8% labeled data matched 96% of the SL model's full-data performance. This shows an annotation-efficient framework for modeling videoconference experience.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes