CVAISep 25, 2025

Instruction-tuned Self-Questioning Framework for Multimodal Reasoning

arXiv:2509.21251v1h-index: 3
Originality Incremental advance
AI Analysis

This addresses challenges in multimodal reasoning for AI systems, offering a more interpretable and effective approach, though it appears incremental as it builds on existing instruction-tuning and self-questioning techniques.

The paper tackles the problem of multi-step reasoning in vision-language understanding by proposing SQ-InstructBLIP, a framework that iteratively generates image-aware sub-questions and sub-answers to improve accuracy on VQA tasks, showing enhanced performance compared to prior methods.

The field of vision-language understanding has been actively researched in recent years, thanks to the development of Large Language Models~(LLMs). However, it still needs help with problems requiring multi-step reasoning, even for very simple questions. Recent studies adopt LLMs to tackle this problem by iteratively generating sub-questions and answers. However, there are disadvantages such as 1) the fine-grained visual contents of images are not available using LLMs that cannot read visual information, 2) internal mechanisms are inaccessible and difficult to reproduce by using black-box LLMs. To solve these problems, we propose the SQ (Self-Questioning)-InstructBLIP, which improves inference performance by generating image-aware informative sub-questions and sub-answers iteratively. The SQ-InstructBLIP, which consists of a Questioner, Answerer, and Reasoner that share the same architecture. Questioner and Answerer generate sub-questions and sub-answers to help infer the main-question, and Reasoner performs reasoning on the main-question considering the generated sub-question information. Our experiments show that the proposed method SQ-InstructBLIP, which uses the generated sub-questions as additional information when solving the VQA task, performs more accurate reasoning than the previous works.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes