CVJun 8

Self-supervised Learning Matters: A Simple Ensemble Solution for Micro-Gesture Recognition

Tingyi Liu, Kun Li, Fei Wang, Junjie Chen, Zhiliang Wu, Jihao Gu, Haixu Liu, Dan Guo

arXiv:2606.09261v110.3

Originality Incremental advance

AI Analysis

For researchers in micro-gesture recognition, this work demonstrates the benefit of self-supervised learning on unlabeled in-domain videos, but the gain is incremental (1.2% improvement) and domain-specific.

The authors propose a multimodal ensemble framework integrating a self-supervised RGB model with supervised multi-stream models for micro-gesture recognition, achieving 74.419% top-1 accuracy on iMiGUE, surpassing the previous state of the art by 1.206 percentage points and ranking first in the MiGA Challenge.

In this paper, we present XInsight Lab's solution to the micro-gesture classification track of the 4th MiGA Challenge at IJCAI 2026, in which our solution ranked first and achieved a new state-of-the-art result. We propose a multimodal ensemble framework that integrates a self-supervised RGB-based model with supervised multi-stream models from previous solutions. The self-supervised RGB model is pretrained on 120K unlabeled clips via masked video modeling and then fine-tuned on iMiGUE. This simple yet effective RGB baseline achieves 69.224% top-1 accuracy on the iMiGUE test set, demonstrating the benefit of learning transferable representations from unlabeled in-domain videos. By incorporating this model as a complementary branch, the final ensemble reaches 74.419% top-1 accuracy, surpassing the previous state of the art by 1.206 percentage points. Experimental results on iMiGUE, including ablation studies on the ensemble strategy, validate the effectiveness of self-supervised RGB representation learning for micro-gesture recognition.

View on arXiv PDF

Similar