A Neurosymbolic Agent System for Compositional Visual Reasoning
This addresses a key bottleneck in multi-modal AI for applications requiring complex visual understanding, though it is an incremental improvement over existing neuro-symbolic methods.
The paper tackles the challenge of compositional visual reasoning in vision-language models by introducing VLAgent, a neuro-symbolic agent system that uses a two-stage reasoning process with error detection and verification, achieving superior performance on six benchmarks compared to state-of-the-art models.
The advancement in large language models (LLMs) and large vision models has fueled the rapid progress in multi-modal vision-language reasoning capabilities. However, existing vision-language models (VLMs) remain challenged by compositional visual reasoning. This paper presents VLAgent, a neuro-symbolic approach to developing a Vision-Language Agent system for efficient compositional visual reasoning with three novel features. First, VLAgent develops an interpretable visualization-enhanced two-stage neuro-symbolic reasoning system. The first stage is managed by a front-end engine that generates a structured visual reasoning plan (symbolic program script) for each compositional visual reasoning task by utilizing a pre-trained LLM powered with few-shot chain-of-thought in-context learning. The second stage is managed by a high-performance back-end engine. It transforms the planning script into executable code based on visual input (image or video) and the combination of neural models and symbolic functions and then performs a sequence of actions for the compositional visual reason task. Second, to ensure and enhance the quality of mapping the logic plan to a sequence of executable instructions, VLAgent introduces the SS-parser, which examines the syntax and semantic correctness of the planning script, detects and repairs the logic errors found in the LLM-generated logic plan before generating the executable program. Third, VLAgent introduces the execution verifier in critical reasoning steps to validate and refine its compositional reasoning results in a stepwise manner, for example, ensemble methods for critical visual reasoning and caption analysis for low-confidence compositional reasoning. Extensive experiments on six visual benchmarks compared to a dozen SoTA visual reasoning models show that VLAgent outperforms existing representative approaches to compositional visual reasoning.