IRCLCVMMJan 24, 2024

SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval

arXiv:2401.13478v244 citationsHas CodeACL
Originality Synthesis-oriented
AI Analysis

This addresses a gap in evaluating MMIR for scientific charts and tables, but it is incremental as it builds on existing methods by applying them to new data.

The authors tackled the lack of benchmarks for multi-modal information retrieval in the scientific domain by creating SciMMIR, a dataset of 530K curated image-text pairs from figures and tables in papers, and evaluated models like CLIP and BLIP to provide insights on pre-training and fine-tuning.

Multi-modal information retrieval (MMIR) is a rapidly evolving field, where significant progress, particularly in image-text pairing, has been made through advanced representation learning and cross-modality alignment research. However, current benchmarks for evaluating MMIR performance in image-text pairing within the scientific domain show a notable gap, where chart and table images described in scholarly language usually do not play a significant role. To bridge this gap, we develop a specialised scientific MMIR (SciMMIR) benchmark by leveraging open-access paper collections to extract data relevant to the scientific domain. This benchmark comprises 530K meticulously curated image-text pairs, extracted from figures and tables with detailed captions in scientific documents. We further annotate the image-text pairs with two-level subset-subcategory hierarchy annotations to facilitate a more comprehensive evaluation of the baselines. We conducted zero-shot and fine-tuning evaluations on prominent multi-modal image-captioning and visual language models, such as CLIP and BLIP. Our analysis offers critical insights for MMIR in the scientific domain, including the impact of pre-training and fine-tuning settings and the influence of the visual and textual encoders. All our data and checkpoints are publicly available at https://github.com/Wusiwei0410/SciMMIR.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes