CVMay 27

CuriosAI Submission to the CASTLE Challenge at EgoVis 2026

Yuto Kanda, Hayato Tanoue, Takayuki Hori

arXiv:2605.2780010.7h-index: 2

AI Analysis

For researchers in egocentric video understanding, this work provides a competitive baseline for a new benchmark, but the results are incremental.

The paper tackles the CASTLE 2026 challenge of answering 185 multiple-choice questions over 600+ hours of egocentric video. Their best approach, SVA, achieves a leaderboard accuracy of 0.50, while TMKG reaches 0.35.

CASTLE 2026 asks 185 multiple-choice questions over 600+ hours of synchronized multi-view egocentric video. We explore two approaches on top of a shared multimodal preprocessing layer, including per-person timelines, speaker-resolved transcripts, and multi-VLM caption ensembles. Approach A, SVA: Search-Verify-Answer, is a three-stage pipeline that hierarchically narrows to a primary window, verifies sub-windows with a VLM under four anti-confabulation rules, and fuses evidence with an LLM judge under an evidence-priority hierarchy. Approach B, TMKG: Temporal-Multimodal-Knowledge-Graph, is the contrast: it builds a temporal multimodal knowledge graph, locates a primary cell via graph search, and produces the final answer with a single grounded VLM. SVA reaches a leaderboard accuracy of 0.50 and is our final challenge submission; TMKG reaches 0.35.

View on arXiv PDF

Similar