CVSep 20, 2025

Enhancing Scientific Visual Question Answering via Vision-Caption aware Supervised Fine-Tuning

Janak Kapuriya, Anwar Shaikh, Arnav Goel, Medha Hira, Apoorv Singh, Jay Saraf, Sanjana, Vaibhav Nauriyal, Avinash Anand, Zhengkui Wang, Rajiv Ratn Shah

arXiv:2509.16628v110.23 citationsh-index: 44Proceedings of the 2nd International Workshop on Large Vision - Language Model Learning and Applications

Originality Incremental advance

AI Analysis

This addresses the problem of enhancing scientific VQA for educational contexts, particularly in low-resource languages, though it is incremental as it builds on existing fine-tuning methods.

The authors tackled scientific visual question answering by introducing VCASFT, a learning paradigm that uses image captions as zero-shot prompts to fine-tune smaller vision-language models, achieving significant performance improvements on ScienceQA and a new Hindi dataset (HiSciVQA).

In this study, we introduce Vision-Caption aware Supervised FineTuning (VCASFT), a novel learning paradigm designed to enhance the performance of smaller Vision Language Models(VLMs) on scientific visual question answering(VQA) tasks. VCASFT leverages image captions as zero-shot prompts alongside question-answer pairs and instruction-tunes models to yield significant performance improvements. To comprehensively evaluate VCASFT, we benchmark it on ScienceQA, which consists of questions across diverse languages, subjects, and fields, demonstrating its adaptability and effectiveness in a variety of educational contexts. Additionally, to further demonstrate the effectiveness of this technique on lowresource languages, we developed HiSciVQA, a dataset comprising 2,245 high-quality, hand-annotated Hindi multimodal Q&A pairs. This dataset addresses the critical need for low-resource language Q&A datasets and serves as a foundation for testing VCASFT. Additionally, we introduce a novel LLM-based evaluation scheme to evaluate VLMs on HiSciVQA which offers deeper insights into model effectiveness surpassing traditional n-gram matching accuracy metrics. We are committed to advancing the field by open-sourcing all code files and the HiSciVQA dataset for the research community.

View on arXiv PDF

Similar