CVAILGSep 10, 2025

World Modeling with Probabilistic Structure Integration

arXiv:2509.09737v15 citationsh-index: 7
Originality Incremental advance
AI Analysis

This work addresses the challenge of creating flexible and interpretable world models for video understanding, offering a novel approach that integrates probabilistic modeling with structure extraction, though it appears incremental in building on existing probabilistic and causal inference methods.

The authors tackled the problem of learning controllable and promptable world models from video data by introducing Probabilistic Structure Integration (PSI), a three-step cycle that builds a probabilistic model, extracts low-dimensional structures via causal inference, and integrates them back into training, resulting in state-of-the-art performance on tasks like optical flow, depth estimation, and object segmentation from 1.4 trillion tokens of internet video.

We present Probabilistic Structure Integration (PSI), a system for learning richly controllable and flexibly promptable world models from data. PSI consists of a three-step cycle. The first step, Probabilistic prediction, involves building a probabilistic graphical model Psi of the data, in the form of a random-access autoregressive sequence model. Psi supports a complete set of learned conditional distributions describing the dependence of any variables in the data on any other set of variables. In step 2, Structure extraction, we show how to extract underlying low-dimensional properties in the data, corresponding to a diverse set of meaningful "intermediate structures", in a zero-shot fashion via causal inference on Psi. Step 3, Integration, completes the cycle by converting these structures into new token types that are then continually mixed back into the training diet as conditioning signals and prediction targets. Each such cycle augments the capabilities of Psi, both allowing it to model the underlying data better, and creating new control handles -- akin to an LLM-like universal prompting language. We train an instance of Psi on 1.4 trillion tokens of internet video data; we use it to perform a variety of useful video prediction and understanding inferences; we extract state-of-the-art optical flow, self-supervised depth and object segmentation; and we use these structures to support a full cycle of predictive improvements.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes