CVCLLGApr 3, 2021

MMBERT: Multimodal BERT Pretraining for Improved Medical VQA

arXiv:2104.01394v1193 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses the problem of costly medical image annotation and domain-specific differences for researchers and practitioners in medical AI, though it is incremental as it adapts existing self-supervised pretraining techniques to the medical domain.

The paper tackled the challenge of applying visual question answering (VQA) to medical images by proposing MMBERT, a multimodal BERT pretraining method that learns richer semantic representations using masked language modeling with image features, achieving new state-of-the-art performance on VQA-Med 2019 and VQA-RAD datasets.

Images in the medical domain are fundamentally different from the general domain images. Consequently, it is infeasible to directly employ general domain Visual Question Answering (VQA) models for the medical domain. Additionally, medical images annotation is a costly and time-consuming process. To overcome these limitations, we propose a solution inspired by self-supervised pretraining of Transformer-style architectures for NLP, Vision and Language tasks. Our method involves learning richer medical image and text semantic representations using Masked Language Modeling (MLM) with image features as the pretext task on a large medical image+caption dataset. The proposed solution achieves new state-of-the-art performance on two VQA datasets for radiology images -- VQA-Med 2019 and VQA-RAD, outperforming even the ensemble models of previous best solutions. Moreover, our solution provides attention maps which help in model interpretability. The code is available at https://github.com/VirajBagal/MMBERT

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes