Towards Multilingual Audio-Visual Question Answering
This work addresses the resource-intensive challenge of multilingual AVQA for researchers by providing scalable datasets and models, though it is incremental as it builds on existing AVQA methods.
The paper tackles the problem of extending Audio-Visual Question Answering (AVQA) to multilingual settings by creating datasets for eight languages using machine translation to avoid manual annotation, and proposes the MERA framework with SOTA models to benchmark these datasets, achieving results that serve as a reference for future research.
In this paper, we work towards extending Audio-Visual Question Answering (AVQA) to multilingual settings. Existing AVQA research has predominantly revolved around English and replicating it for addressing AVQA in other languages requires a substantial allocation of resources. As a scalable solution, we leverage machine translation and present two multilingual AVQA datasets for eight languages created from existing benchmark AVQA datasets. This prevents extra human annotation efforts of collecting questions and answers manually. To this end, we propose, MERA framework, by leveraging state-of-the-art (SOTA) video, audio, and textual foundation models for AVQA in multiple languages. We introduce a suite of models namely MERA-L, MERA-C, MERA-T with varied model architectures to benchmark the proposed datasets. We believe our work will open new research directions and act as a reference benchmark for future works in multilingual AVQA.