LGAIROMar 19, 2025

VIPER: Visual Perception and Explainable Reasoning for Sequential Decision-Making

arXiv:2503.15108v37 citationsh-index: 17
Originality Incremental advance
AI Analysis

This addresses the challenge of multimodal planning for AI agents, though it is incremental as it builds on existing VLM and LLM technologies.

The paper tackles the problem of visual instruction-based planning by introducing VIPER, a framework that integrates VLM-based perception with LLM-based reasoning, and it significantly outperforms state-of-the-art planners on the ALFWorld benchmark while narrowing the gap with text-based oracles.

While Large Language Models (LLMs) excel at reasoning on text and Vision-Language Models (VLMs) are highly effective for visual perception, applying those models for visual instruction-based planning remains a widely open problem. In this paper, we introduce VIPER, a novel framework for multimodal instruction-based planning that integrates VLM-based perception with LLM-based reasoning. Our approach uses a modular pipeline where a frozen VLM generates textual descriptions of image observations, which are then processed by an LLM policy to predict actions based on the task goal. We fine-tune the reasoning module using behavioral cloning and reinforcement learning, improving our agent's decision-making capabilities. Experiments on the ALFWorld benchmark show that VIPER significantly outperforms state-of-the-art visual instruction-based planners while narrowing the gap with purely text-based oracles. By leveraging text as an intermediate representation, VIPER also enhances explainability, paving the way for a fine-grained analysis of perception and reasoning components.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes