CVNov 18, 2025

Vision Large Language Models Are Good Noise Handlers in Engagement Analysis

arXiv:2511.14749v12 citations
Originality Incremental advance
AI Analysis

This addresses the problem of noisy labels in engagement analysis for video datasets, offering an incremental improvement over existing methods.

The paper tackles the problem of engagement recognition in video datasets, which is challenged by subjective and noisy labels, by proposing a framework that uses Vision Large Language Models (VLMs) to refine annotations and guide training. The result shows improvements over prior state-of-the-art methods, with maximum gains of +1.21% on EngageNet and F1 gains of +0.22 and +0.06 on other benchmarks.

Engagement recognition in video datasets, unlike traditional image classification tasks, is particularly challenged by subjective labels and noise limiting model performance. To overcome the challenges of subjective and noisy engagement labels, we propose a framework leveraging Vision Large Language Models (VLMs) to refine annotations and guide the training process. Our framework uses a questionnaire to extract behavioral cues and split data into high- and low-reliability subsets. We also introduce a training strategy combining curriculum learning with soft label refinement, gradually incorporating ambiguous samples while adjusting supervision to reflect uncertainty. We demonstrate that classical computer vision models trained on refined high-reliability subsets and enhanced with our curriculum strategy show improvements, highlighting benefits of addressing label subjectivity with VLMs. This method surpasses prior state of the art across engagement benchmarks such as EngageNet (three of six feature settings, maximum improvement of +1.21%), and DREAMS / PAFE with F1 gains of +0.22 / +0.06.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes