CVCLMar 22, 2023

Integrating Image Features with Convolutional Sequence-to-sequence Network for Multilingual Visual Question Answering

arXiv:2303.12671v21 citationsh-index: 10
AI Analysis

This work addresses multilingual VQA for English, Vietnamese, and Japanese, but is incremental as it builds on existing methods for a new dataset.

The paper tackled multilingual visual question answering by integrating image features and pre-trained VQA hints with a convolutional sequence-to-sequence network, achieving F1 scores of 0.3442 on the public test set and 0.4210 on the private test set, placing 3rd in the VLSP2022-EVJVQA competition.

Visual Question Answering (VQA) is a task that requires computers to give correct answers for the input questions based on the images. This task can be solved by humans with ease but is a challenge for computers. The VLSP2022-EVJVQA shared task carries the Visual Question Answering task in the multilingual domain on a newly released dataset: UIT-EVJVQA, in which the questions and answers are written in three different languages: English, Vietnamese and Japanese. We approached the challenge as a sequence-to-sequence learning task, in which we integrated hints from pre-trained state-of-the-art VQA models and image features with Convolutional Sequence-to-Sequence network to generate the desired answers. Our results obtained up to 0.3442 by F1 score on the public test set, 0.4210 on the private test set, and placed 3rd in the competition.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes