CVOct 11, 2021

Pano-AVQA: Grounded Audio-Visual Question Answering on 360$^\circ$ Videos

arXiv:2110.05122v1127 citations
Originality Incremental advance
AI Analysis

This addresses a gap in evaluating panoramic video understanding for researchers in computer vision and multimedia, though it is incremental as it builds on existing audio-visual QA tasks.

The authors tackled the lack of benchmarks for evaluating semantic understanding of audio-visual relationships and spherical spatial properties in panoramic videos by introducing Pano-AVQA, a large-scale grounded audio-visual question answering dataset using 5.4K 360° video clips, and their transformer-based models with spherical spatial embeddings and multimodal training objectives showed improved performance on the dataset.

360$^\circ$ videos convey holistic views for the surroundings of a scene. It provides audio-visual cues beyond pre-determined normal field of views and displays distinctive spatial relations on a sphere. However, previous benchmark tasks for panoramic videos are still limited to evaluate the semantic understanding of audio-visual relationships or spherical spatial property in surroundings. We propose a novel benchmark named Pano-AVQA as a large-scale grounded audio-visual question answering dataset on panoramic videos. Using 5.4K 360$^\circ$ video clips harvested online, we collect two types of novel question-answer pairs with bounding-box grounding: spherical spatial relation QAs and audio-visual relation QAs. We train several transformer-based models from Pano-AVQA, where the results suggest that our proposed spherical spatial embeddings and multimodal training objectives fairly contribute to a better semantic understanding of the panoramic surroundings on the dataset.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes