Visually Interpretable Subtask Reasoning for Visual Question Answering
This work addresses the need for more interpretable and accurate reasoning in AI systems for visual question answering, though it is incremental as it builds on existing subtask decomposition methods.
The paper tackles the problem of improving interpretability and reasoning accuracy in multimodal large language models for complex visual question answering by introducing VISTAR, a subtask-driven training framework that generates step-by-step rationales, resulting in consistent accuracy improvements on two benchmarks.
Answering complex visual questions like `Which red furniture can be used for sitting?' requires multi-step reasoning, including object recognition, attribute filtering, and relational understanding. Recent work improves interpretability in multimodal large language models (MLLMs) by decomposing tasks into sub-task programs, but these methods are computationally expensive and less accurate due to poor adaptation to target data. To address this, we introduce VISTAR (Visually Interpretable Subtask-Aware Reasoning Model), a subtask-driven training framework that enhances both interpretability and reasoning by generating textual and visual explanations within MLLMs. Instead of relying on external models, VISTAR fine-tunes MLLMs to produce structured Subtask-of-Thought rationales (step-by-step reasoning sequences). Experiments on two benchmarks show that VISTAR consistently improves reasoning accuracy while maintaining interpretability. Our code and dataset will be available at https://github.com/ChengJade/VISTAR.