CVAIJan 24, 2025

ENTER: Event Based Interpretable Reasoning for VideoQA

arXiv:2501.14194v15 citationsh-index: 20
Originality Incremental advance
AI Analysis

This addresses the need for interpretable reasoning in VideoQA, combining the strengths of top-down and bottom-up approaches, though it appears incremental as it builds on existing event graph concepts.

The paper tackles the problem of interpretable Video Question Answering by introducing ENTER, a system that converts videos into event graphs for structured reasoning, which outperforms top-down approaches and achieves competitive performance against bottom-up methods on datasets like NExT-QA, IntentQA, and EgoSchema.

In this paper, we present ENTER, an interpretable Video Question Answering (VideoQA) system based on event graphs. Event graphs convert videos into graphical representations, where video events form the nodes and event-event relationships (temporal/causal/hierarchical) form the edges. This structured representation offers many benefits: 1) Interpretable VideoQA via generated code that parses event-graph; 2) Incorporation of contextual visual information in the reasoning process (code generation) via event graphs; 3) Robust VideoQA via Hierarchical Iterative Update of the event graphs. Existing interpretable VideoQA systems are often top-down, disregarding low-level visual information in the reasoning plan generation, and are brittle. While bottom-up approaches produce responses from visual data, they lack interpretability. Experimental results on NExT-QA, IntentQA, and EgoSchema demonstrate that not only does our method outperform existing top-down approaches while obtaining competitive performance against bottom-up approaches, but more importantly, offers superior interpretability and explainability in the reasoning process.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes