CVLGDec 15, 2025

Do-Undo: Generating and Reversing Physical Actions in Vision-Language Models

arXiv:2512.13609v1h-index: 10
Originality Incremental advance
AI Analysis

This addresses a critical gap in physical reasoning for vision-language models, with applications in embodied AI and robotics, though it appears incremental as it builds on prior object-level edit work.

The paper tackles the problem of vision-language models lacking understanding of physically plausible scene transformations by introducing the Do-Undo task and benchmark, which requires simulating and reversing physical actions, and reveals that current models struggle with this, highlighting its importance for embodied AI and robotics.

We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating physically plausible scene transformations driven by real-world actions. Unlike prior work focused on object-level edits, Do-Undo requires models to simulate the outcome of a physical action and then accurately reverse it, reflecting true cause-and-effect in the visual world. We curate a large-scale dataset of reversible actions from real-world videos and design a training strategy enforcing consistency for robust action grounding. Our experiments reveal that current models struggle with physical reversibility, underscoring the importance of this task for embodied AI, robotics, and physics-aware generative modeling. Do-Undo establishes an intuitive testbed for evaluating and advancing physical reasoning in multimodal systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes