CVSep 24, 2025

A Simple Data Augmentation Strategy for Text-in-Image Scientific VQA

arXiv:2509.20119v12 citationsh-index: 16Proceedings of the 9th Widening NLP Workshop
Originality Incremental advance
AI Analysis

This work addresses data scarcity for researchers in scientific VQA, but it is incremental as it builds on existing paradigms like EXAMS-V.

The paper tackled the challenge of scientific visual question answering by addressing data scarcity in a text-in-image format, resulting in notable gains across 13 languages through fine-tuning a small multilingual multimodal model on synthetic data.

Scientific visual question answering poses significant challenges for vision-language models due to the complexity of scientific figures and their multimodal context. Traditional approaches treat the figure and accompanying text (e.g., questions and answer options) as separate inputs. EXAMS-V introduced a new paradigm by embedding both visual and textual content into a single image. However, even state-of-the-art proprietary models perform poorly on this setup in zero-shot settings, underscoring the need for task-specific fine-tuning. To address the scarcity of training data in this "text-in-image" format, we synthesize a new dataset by converting existing separate image-text pairs into unified images. Fine-tuning a small multilingual multimodal model on a mix of our synthetic data and EXAMS-V yields notable gains across 13 languages, demonstrating strong average improvements and cross-lingual transfer.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes