ROMay 29

Notes-to-Self: Scratchpad Augmented VLAs for Memory Dependent Manipulation Tasks

arXiv:2602.2101391.8h-index: 14
AI Analysis

This work tackles the problem of memory limitations in VLAs for robotic manipulation, which is crucial for developing more capable and generalizable robotic agents.

This paper addresses the limitation of stateless Vision-Language-Action (VLA) models in handling non-markovian, memory-dependent manipulation tasks. By integrating a language scratchpad, the authors enable VLAs to store task-specific information and track plans, leading to significant improvements in generalization on ClevrSkills, MemoryBench, and a real-world pick-and-place task.

Many dexterous manipulation tasks are non-markovian in nature, yet little attention has been paid to this fact in the recent upsurge of the vision-language-action (VLA) paradigm. Although they are successful in bringing internet-scale semantic understanding to robotics, existing VLAs are primarily "stateless" and struggle with memory-dependent long horizon tasks. In this work, we explore a way to impart both spatial and temporal memory to a VLA by incorporating a language scratchpad. The scratchpad makes it possible to memorize task-specific information, such as object positions, and it allows the model to keep track of a plan and progress towards subgoals within that plan. We evaluate this approach on a split of memory-dependent tasks from the ClevrSkills environment, on MemoryBench, as well as on a challenging real-world pick-and-place task. We show that incorporating a language scratchpad significantly improves generalization on these tasks for both non-recurrent and recurrent models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes