AIMay 6

PRISM: Perception Reasoning Interleaved for Sequential Decision Making

Mohamed Salim Aissi, Clemence Grislain, Clement Romac, Laure Soulier, Mohamed Chetouani, Olivier Sigaud, Nicolas Thome

arXiv:2605.0540755.5h-index: 2

AI Analysis

This work addresses the perception-reasoning gap in multimodal embodied agents, offering a fully automatic framework that improves task-driven scene understanding.

PRISM introduces a closed-loop interaction between a VLM and an LLM to improve perception for embodied agents, achieving state-of-the-art results on ALFWorld and R2R benchmarks.

Scaling LLM-based embodied agents from text-only environments to complex multimodal settings remains a major challenge. Recent work identifies a perception-reasoning-decision gap in standalone Vision-Language Models (VLMs), which often overlook task-critical information. In this paper, we introduce PRISM, a framework that tightly couples perception (VLM) and decision (LLM) through a dynamic question-answer (DQA) pipeline. Instead of passively accepting the VLM's description, the LLM critiques it, probes the VLM with goal-oriented questions, and synthesizes a compact image description. This closed-loop interaction yields a sharp, task-driven understanding of the scene. We evaluate PRISM on the ALFWorld and Room-to-Room (R2R) benchmarks. We show that: (1) PRISM significantly outperforms state-of-the-art image-based models, (2) our Interactive goal-oriented perception pipeline yields systematic and substantial gains, and (3) PRISM is fully automatic, eliminating the need for handcrafted questions or answers.

View on arXiv PDF

Similar