CVMar 13

SAVA-X: Ego-to-Exo Imitation Error Detection via Scene-Adaptive View Alignment and Bidirectional Cross View Fusion

arXiv:2603.1276489.51 citationsHas Code
Predicted impact top 16% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses a practical problem in industrial training and healthcare for assessing imitation quality, but it is incremental as it builds on existing video analysis techniques.

The paper tackles the problem of detecting errors in first-person imitation videos using third-person demonstrations, addressing challenges like cross-view domain shift and temporal misalignment. The proposed SAVA-X method improves AUPRC and mean tIoU on the EgoMe benchmark over adapted baselines.

Error detection is crucial in industrial training, healthcare, and assembly quality control. Most existing work assumes a single-view setting and cannot handle the practical case where a third-person (exo) demonstration is used to assess a first-person (ego) imitation. We formalize Ego$\rightarrow$Exo Imitation Error Detection: given asynchronous, length-mismatched ego and exo videos, the model must localize procedural steps on the ego timeline and decide whether each is erroneous. This setting introduces cross-view domain shift, temporal misalignment, and heavy redundancy. Under a unified protocol, we adapt strong baselines from dense video captioning and temporal action detection and show that they struggle in this cross-view regime. We then propose SAVA-X, an Align-Fuse-Detect framework with (i) view-conditioned adaptive sampling, (ii) scene-adaptive view embeddings, and (iii) bidirectional cross-attention fusion. On the EgoMe benchmark, SAVA-X consistently improves AUPRC and mean tIoU over all baselines, and ablations confirm the complementary benefits of its components. Code is available at https://github.com/jack1ee/SAVAX.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes