CVAILGJun 1

Understanding-Enhanced Model Collaboration for Long-Tailed Egocentric Mistake Detection

arXiv:2606.0212081.3
Predicted impact top 26% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For egocentric video analysis, this work addresses the practical need to detect rare and subtle user mistakes in real-time applications.

The paper tackles long-tailed egocentric mistake detection in instructional videos, proposing UE-MCM that fuses a small CLIP-based model for coarse-grained workflow consistency and a large Qwen3-VL model for fine-grained action correctness, achieving balanced speed and accuracy.

In this report, we address the problem of determining whether a user performs an action incorrectly from egocentric video data. To this end, we propose an Understanding-Enhanced Model Collaboration Method (UE-MCM) that combines efficient coarse-grained video understanding with accurate fine-grained action reasoning. Specifically, UE-MCM contains a small model branch and a large model branch. The large model branch focuses on whether the fine-grained action itself is executed incorrectly, while the small model branch jointly takes the coarse-grained video and fine-grained segment as input to identify actions that may be locally correct but inconsistent with the overall workflow. The small model branch is built on a CLIP4CLIP video encoder initialized from a CLIP model enhanced by Diffusion Contrastive Reconstruction, and the large model branch uses the Qwen3-VL Embedding model to extract high-capacity representations from fine-grained action segments. The small-branch prediction and the large-branch prediction are then adaptively fused by a lightweight collaboration gate. To handle the long-tailed distribution of mistake instances, we optimize the classifiers with complementary objectives, including reweighted cross-entropy, AUC-oriented learning, and label-aware adjustment. The resulting system balances speed and accuracy, making it effective for detecting subtle, rare, and ambiguous mistakes in egocentric instructional videos.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes