CVOct 27, 2020

MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering

arXiv:2010.14095v11007 citations
Originality Incremental advance
AI Analysis

It addresses VQA for multimodal AI applications, with incremental improvements in fusion techniques.

The paper tackles Visual Question Answering (VQA) by proposing MMFT-BERT, a method that processes video and text modalities with BERT encodings and a transformer-based fusion, achieving state-of-the-art results on the TVQA dataset.

We present MMFT-BERT(MultiModal Fusion Transformer with BERT encodings), to solve Visual Question Answering (VQA) ensuring individual and combined processing of multiple input modalities. Our approach benefits from processing multimodal data (video and text) adopting the BERT encodings individually and using a novel transformer-based fusion method to fuse them together. Our method decomposes the different sources of modalities, into different BERT instances with similar architectures, but variable weights. This achieves SOTA results on the TVQA dataset. Additionally, we provide TVQA-Visual, an isolated diagnostic subset of TVQA, which strictly requires the knowledge of visual (V) modality based on a human annotator's judgment. This set of questions helps us to study the model's behavior and the challenges TVQA poses to prevent the achievement of super human performance. Extensive experiments show the effectiveness and superiority of our method.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes