CVSep 3, 2020

A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports

arXiv:2009.01523v119.178 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of enhancing multimodal representation learning for medical vision-and-language tasks, such as classification and retrieval, but it is incremental as it applies existing models to a specific domain without introducing new methods.

The study compared four pre-trained vision-and-language models for learning multimodal representations from medical images and reports, finding that they improved performance in thoracic findings classification compared to a CNN-RNN baseline, with specific gains demonstrated through ablation studies and attention visualizations.

Joint image-text embedding extracted from medical images and associated contextual reports is the bedrock for most biomedical vision-and-language (V+L) tasks, including medical visual question answering, clinical image-text retrieval, clinical report auto-generation. In this study, we adopt four pre-trained V+L models: LXMERT, VisualBERT, UNIER and PixelBERT to learn multimodal representation from MIMIC-CXR radiographs and associated reports. The extrinsic evaluation on OpenI dataset shows that in comparison to the pioneering CNN-RNN model, the joint embedding learned by pre-trained V+L models demonstrate performance improvement in the thoracic findings classification task. We conduct an ablation study to analyze the contribution of certain model components and validate the advantage of joint embedding over text-only embedding. We also visualize attention maps to illustrate the attention mechanism of V+L models.

View on arXiv PDF Code

Similar