CVMay 22

Learning a Particle Dynamics Model with Real-world Videos

arXiv:2605.2384570.5
AI Analysis

This work addresses the sim-to-real gap in learning physics simulators by enabling training from real-world videos, which is a key bottleneck for applying world models to real-world scenarios.

The paper introduces a framework to train neural object dynamics models directly from unlabeled real-world videos, using a particle-based model within a Gaussian splatting framework that predicts position and rotation changes. The model is trained via rendering supervision, eliminating the need for particle-level labels, and is evaluated on a new dataset of ~500 videos.

Data-driven learning approaches for physics simulation, sometimes referred to as world models, have emerged as promising alternatives to traditional physics simulators due to their differentiable nature. Prior work has demonstrated impressive results in predicting the motions of rigid and non-rigid objects in complex scenes involving multiple interacting bodies. However, these models are typically trained in simulated environments because obtaining perfect state information such as complete scene point clouds and point correspondences over time is challenging in real-world settings. This reliance on synthetic data can limit their applicability when the sim-to-real gap is large. In this work, we aim to overcome these limitations by introducing a novel framework for training neural object dynamics models directly from unlabeled real-world videos. Specifically, we propose to learn a particle-based dynamics model compatible with a Gaussian splatting framework, which operates on dense particles derived from Gaussians (i.e., particles with scales and rotations) and predicts their position and rotation changes over time. The model is trained via rendering supervision, enabling learning from real-world videos without requiring particle-level labeled states. Our model operates directly on dense Gaussians without relying on heuristic subsampling anchor points. To enable this study, we also present a real-world dataset consisting of about 500 videos capturing diverse object interactions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes