ROLGMar 4

MEM: Multi-Scale Embodied Memory for Vision Language Action Models

MIT
arXiv:2603.03596v110 citationsh-index: 44
Originality Highly original
AI Analysis

This work addresses the problem of long-horizon robotic control for tasks that require remembering past events at multiple levels of granularity, which is significant for robotics and automation.

The authors tackled the problem of robotic control in complex multi-stage tasks by introducing a multi-scale embodied memory architecture, achieving tasks that span up to fifteen minutes. This resulted in robot policies that can perform tasks like cleaning up a kitchen or preparing a grilled cheese sandwich.

Conventionally, memory in end-to-end robotic learning involves inputting a sequence of past observations into the learned policy. However, in complex multi-stage real-world tasks, the robot's memory must represent past events at multiple levels of granularity: from long-term memory that captures abstracted semantic concepts (e.g., a robot cooking dinner should remember which stages of the recipe are already done) to short-term memory that captures recent events and compensates for occlusions (e.g., a robot remembering the object it wants to pick up once its arm occludes it). In this work, our main insight is that an effective memory architecture for long-horizon robotic control should combine multiple modalities to capture these different levels of abstraction. We introduce Multi-Scale Embodied Memory (MEM), an approach for mixed-modal long-horizon memory in robot policies. MEM combines video-based short-horizon memory, compressed via a video encoder, with text-based long-horizon memory. Together, they enable robot policies to perform tasks that span up to fifteen minutes, like cleaning up a kitchen, or preparing a grilled cheese sandwich. Additionally, we find that memory enables MEM policies to intelligently adapt manipulation strategies in-context.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes