CVMay 30

TAP-JEPA: Frozen Future-Latent Probing and Two-Stage Score Fusion for EPIC-KITCHENS-100 Action Anticipation

arXiv:2606.006625.0

Predicted impact top 81% in CV · last 90 daysOriginality Synthesis-oriented

AI Analysis

This work addresses action anticipation from egocentric video, a key problem for human-robot interaction and assistive AI, but the improvement is incremental (0.04% behind top).

TAP-JEPA achieves 27.91% overall action Mean Top-5 Recall on EPIC-KITCHENS-100 action anticipation, ranking second by only 0.04 percentage points, using frozen V-JEPA features and a two-stage score fusion method.

This report presents TAP-JEPA, our runner-up submission to the EPIC-KITCHENS-100 (EK-100) Action Anticipation Challenge at EgoVis 2026. The task is to anticipate the next verb, noun, and verb-noun action from an egocentric clip that ends before the target action begins. Instead of fine-tuning a large video backbone, TAP-JEPA builds a compact anticipation model on frozen V-JEPA 2.1 features: a ViT-G/384 encoder extracts visible pre-action tokens, the pre-trained latent predictor estimates near-future tokens from the observed context, and both token groups are fused by attentive probes with task-specific queries for verbs, nouns, and action pairs. For the final submission, we expand supervised training with the official training split and most of the validation split, reserving a small subset for sanity checks and qualitative inspection, and adopt a two-stage score fusion that first averages eight independently initialized probe replicas within each epoch and then merges candidates from epochs 12-20 with field-dependent weights. On the official open-testing leaderboard, our sunshinesky entry achieves 27.91 percent overall action Mean Top-5 Recall (MT5R), ranking second and only 0.04 percentage points behind the top score.

View on arXiv PDF

Similar