Dense but Efficient VideoQA for Intricate Compositional Reasoning
This addresses the challenge of complex reasoning in video question answering for AI systems, though it appears incremental as it builds on existing transformer and attention techniques.
The paper tackles the problem of answering complex compositional questions about long videos by proposing a transformer-based method with deformable attention to efficiently sample informative visual features across many frames. The model outperforms baselines on compositional VideoQA tasks.
It is well known that most of the conventional video question answering (VideoQA) datasets consist of easy questions requiring simple reasoning processes. However, long videos inevitably contain complex and compositional semantic structures along with the spatio-temporal axis, which requires a model to understand the compositional structures inherent in the videos. In this paper, we suggest a new compositional VideoQA method based on transformer architecture with a deformable attention mechanism to address the complex VideoQA tasks. The deformable attentions are introduced to sample a subset of informative visual features from the dense visual feature map to cover a temporally long range of frames efficiently. Furthermore, the dependency structure within the complex question sentences is also combined with the language embeddings to readily understand the relations among question words. Extensive experiments and ablation studies show that the suggested dense but efficient model outperforms other baselines.