Adaptive Dense Evidence Refinement for Video Relational Reasoning for VRR-QA Challenge
For video-language systems, this work addresses the problem of inferring complex spatial-temporal relations by separating answer plausibility from answer change decisions, but the approach is incremental and tailored to a specific challenge.
The authors propose an inference-only system for VRR-QA that uses adaptive test-time computation to route difficult questions to a high-budget dense evidence module, achieving 90.07% average accuracy and 87.81% macro average accuracy on the test split.
VRR-QA evaluates whether video-language systems can infer spatial, temporal, viewpoint, depth, and visibility relations that are not always resolved by a single frame. We present an inference-only system built around adaptive test-time computation. The system first answers each question with a direct video-language model pass, then uses multiple lightweight views to find unstable questions. Only these difficult questions are routed to a high-budget dense evidence module that constructs timestamped frame observations, relation-specific probes, candidate verification, and conservative temporal aggregation. This design separates two problems that are often confused in video question answering: finding plausible alternative answers and deciding when a current answer should actually be changed. On the test split, the final system obtains 90.07 average accuracy and 87.81 macro average accuracy. The report focuses on the final test system and the implementation settings required to reproduce the adaptive dense verifier.