CVNov 21, 2025

Counterfactual World Models via Digital Twin-conditioned Video Diffusion

Yiqing Shen, Aiza Maksutova, Chenjia Li, Mathias Unberath

arXiv:2511.17481v13.6

Originality Highly original

AI Analysis

This addresses the need for comprehensive evaluations of physical AI behavior under varying conditions, offering a novel approach for counterfactual reasoning in video simulation.

The paper tackles the problem of enabling world models to answer counterfactual queries, such as predicting visual sequences under hypothetical modifications to scene properties, by introducing CWMDT, a framework that uses digital twins and large language models to condition video diffusion models, achieving state-of-the-art performance on two benchmarks.

World models learn to predict the temporal evolution of visual observations given a control signal, potentially enabling agents to reason about environments through forward simulation. Because of the focus on forward simulation, current world models generate predictions based on factual observations. For many emerging applications, such as comprehensive evaluations of physical AI behavior under varying conditions, the ability of world models to answer counterfactual queries, such as "what would happen if this object was removed?", is of increasing importance. We formalize counterfactual world models that additionally take interventions as explicit inputs, predicting temporal sequences under hypothetical modifications to observed scene properties. Traditional world models operate directly on entangled pixel-space representations where object properties and relationships cannot be selectively modified. This modeling choice prevents targeted interventions on specific scene properties. We introduce CWMDT, a framework to overcome those limitations, turning standard video diffusion models into effective counterfactual world models. First, CWMDT constructs digital twins of observed scenes to explicitly encode objects and their relationships, represented as structured text. Second, CWMDT applies large language models to reason over these representations and predict how a counterfactual intervention propagates through time to alter the observed scene. Third, CWMDT conditions a video diffusion model with the modified representation to generate counterfactual visual sequences. Evaluations on two benchmarks show that the CWMDT approach achieves state-of-the-art performance, suggesting that alternative representations of videos, such as the digital twins considered here, offer powerful control signals for video forward simulation-based world models.

View on arXiv PDF

Similar