AS CL HC LG MMJun 1, 2025

Multimodal Fusion with Semi-Supervised Learning Minimizes Annotation Quantity for Modeling Videoconference Conversation Experience

Andrew Chang, Chenkai Hu, Ji Qi, Zhuojian Wei, Kexin Zhang, Viswadruth Akkaraju, David Poeppel, Dustin Freeman

arXiv:2506.13971v11.2h-index: 6INTERSPEECH

Originality Incremental advance

AI Analysis

This provides an annotation-efficient framework for modeling videoconference experience, addressing a domain-specific problem with incremental improvements in data efficiency.

The paper tackled the problem of predicting negative experiences in videoconference conversations by applying semi-supervised learning with multimodal fusion, achieving an ROC-AUC of 0.9 and matching 96% of supervised learning performance with only 8% labeled data.

Group conversations over videoconferencing are a complex social behavior. However, the subjective moments of negative experience, where the conversation loses fluidity or enjoyment remain understudied. These moments are infrequent in naturalistic data, and thus training a supervised learning (SL) model requires costly manual data annotation. We applied semi-supervised learning (SSL) to leverage targeted labeled and unlabeled clips for training multimodal (audio, facial, text) deep features to predict non-fluid or unenjoyable moments in holdout videoconference sessions. The modality-fused co-training SSL achieved an ROC-AUC of 0.9 and an F1 score of 0.6, outperforming SL models by up to 4% with the same amount of labeled data. Remarkably, the best SSL model with just 8% labeled data matched 96% of the SL model's full-data performance. This shows an annotation-efficient framework for modeling videoconference experience.

View on arXiv PDF

Similar