CVMay 19, 2025

Understanding Complexity in VideoQA via Visual Program Generation

SalesforceStanford
arXiv:2505.13429v1h-index: 64ICML
Originality Incremental advance
AI Analysis

This addresses the challenge of designing effective benchmarks for VideoQA researchers, though it is incremental as it builds on existing code generation methods.

The paper tackled the problem of analyzing query complexity in Video Question Answering by proposing a data-driven approach that uses code generation complexity as a proxy for question difficulty, showing it correlates better with model performance than human estimates and enabling the creation of a benchmark 1.9 times harder than NExT-QA.

We propose a data-driven approach to analyzing query complexity in Video Question Answering (VideoQA). Previous efforts in benchmark design have relied on human expertise to design challenging questions, yet we experimentally show that humans struggle to predict which questions are difficult for machine learning models. Our automatic approach leverages recent advances in code generation for visual question answering, using the complexity of generated code as a proxy for question difficulty. We demonstrate that this measure correlates significantly better with model performance than human estimates. To operationalize this insight, we propose an algorithm for estimating question complexity from code. It identifies fine-grained primitives that correlate with the hardest questions for any given set of models, making it easy to scale to new approaches in the future. Finally, to further illustrate the utility of our method, we extend it to automatically generate complex questions, constructing a new benchmark that is 1.9 times harder than the popular NExT-QA.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes