CVMay 23

EgoAdapt: A Multi-Scene Egocentric Adaptation Method for CVPR 2026 HD-EPIC VQA Challenge

arXiv:2605.2450092.51 citations
Predicted impact top 8% in CV · last 90 daysOriginality Synthesis-oriented
AI Analysis

For researchers in egocentric vision and video QA, this work addresses the mismatch between generic inference and diverse question structures, providing a practical adaptation method that boosts performance on a challenging benchmark.

EgoAdapt tackles the challenge of heterogeneous question types in egocentric video QA by introducing inference-time category-conditioned routing, calibrated option scoring, and test-time consistency adaptation, achieving substantial improvements over HD-EPIC baselines.

This technical report presents our solution, EgoAdapt (Egocentric Adaptation via Category, Calibration, and Consistency), to the CVPR 2026 HD-EPIC VQA challenge. HD-EPIC evaluates whether a vision-language model can reason over realistic first-person kitchen videos, where the evidence for an answer may be a short hand-object interaction, a long recipe trajectory, a spatial relation to a fixture, or a subtle gaze cue. The benchmark contains 26K multiple-choice questions across seven macro-categories: recipe, ingredient, nutrition, fine-grained action, 3D perception, object motion, and gaze. We observe that the main difficulty is not only model capacity, but also the mismatch between a single generic inference recipe and the heterogeneous temporal, spatial, and semantic structure of the benchmark. Our method, EgoAdapt, introduces three inference-time components: (1) category-conditioned routing with per-category prompts, frame budgets, and sampling rates; (2) calibrated option scoring that evaluates all candidate answers with letter-token likelihoods and generation agreement instead of relying only on direct generation; and (3) test-time consistency adaptation that aggregates predictions across option permutations and verification-style prompts for ambiguous cases. This design substantially improves over the available HD-EPIC baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes