AIMay 6

PRISM: Perception Reasoning Interleaved for Sequential Decision Making

arXiv:2605.0540755.5h-index: 2
AI Analysis

This work addresses the perception-reasoning gap in multimodal embodied agents, offering a fully automatic framework that improves task-driven scene understanding.

PRISM introduces a closed-loop interaction between a VLM and an LLM to improve perception for embodied agents, achieving state-of-the-art results on ALFWorld and R2R benchmarks.

Scaling LLM-based embodied agents from text-only environments to complex multimodal settings remains a major challenge. Recent work identifies a perception-reasoning-decision gap in standalone Vision-Language Models (VLMs), which often overlook task-critical information. In this paper, we introduce PRISM, a framework that tightly couples perception (VLM) and decision (LLM) through a dynamic question-answer (DQA) pipeline. Instead of passively accepting the VLM's description, the LLM critiques it, probes the VLM with goal-oriented questions, and synthesizes a compact image description. This closed-loop interaction yields a sharp, task-driven understanding of the scene. We evaluate PRISM on the ALFWorld and Room-to-Room (R2R) benchmarks. We show that: (1) PRISM significantly outperforms state-of-the-art image-based models, (2) our Interactive goal-oriented perception pipeline yields systematic and substantial gains, and (3) PRISM is fully automatic, eliminating the need for handcrafted questions or answers.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes