CVNov 29, 2023

Towards Top-Down Reasoning: An Explainable Multi-Agent Approach for Visual Question Answering

arXiv:2311.17331v411 citationsh-index: 4
Originality Incremental advance
AI Analysis

This work addresses the need for more explainable and effective VQA systems by integrating LLM knowledge to overcome limitations in existing VLM methods, though it is incremental as it builds on prior multi-agent and reasoning approaches.

The paper tackles the problem of improving Vision Language Models (VLMs) for Visual Question Answering (VQA) by introducing a multi-agent collaboration framework that leverages Large Language Models (LLMs) to enhance VLM capabilities through top-down reasoning, achieving superior performance and interpretability in zero-shot settings without extra training cost.

Recently, to comprehensively improve Vision Language Models (VLMs) for Visual Question Answering (VQA), several methods have been proposed to further reinforce the inference capabilities of VLMs to independently tackle VQA tasks rather than some methods that only utilize VLMs as aids to Large Language Models (LLMs). However, these methods ignore the rich common-sense knowledge inside the given VQA image sampled from the real world. Thus, they cannot fully use the powerful VLM for the given VQA question to achieve optimal performance. Attempt to overcome this limitation and inspired by the human top-down reasoning process, i.e., systematically exploring relevant issues to derive a comprehensive answer, this work introduces a novel, explainable multi-agent collaboration framework by leveraging the expansive knowledge of Large Language Models (LLMs) to enhance the capabilities of VLMs themselves. Specifically, our framework comprises three agents, i.e., Responder, Seeker, and Integrator, to collaboratively answer the given VQA question by seeking its relevant issues and generating the final answer in such a top-down reasoning process. The VLM-based Responder agent generates the answer candidates for the question and responds to other relevant issues. The Seeker agent, primarily based on LLM, identifies relevant issues related to the question to inform the Responder agent and constructs a Multi-View Knowledge Base (MVKB) for the given visual scene by leveraging the build-in world knowledge of LLM. The Integrator agent combines knowledge from the Seeker agent and the Responder agent to produce the final VQA answer. Extensive and comprehensive evaluations on diverse VQA datasets with a variety of VLMs demonstrate the superior performance and interpretability of our framework over the baseline method in the zero-shot setting without extra training cost.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes