CVAIDec 4, 2025

PhyVLLM: Physics-Guided Video Language Model with Motion-Appearance Disentanglement

arXiv:2512.04532v13 citationsh-index: 28
Originality Highly original
AI Analysis

This addresses the limitation of Video LLMs in understanding physical dynamics, which is crucial for applications like robotics and autonomous systems, and represents a novel integration of physics modeling into video-language tasks.

The paper tackled the problem of Video LLMs failing in scenarios requiring physical dynamics understanding by proposing PhyVLLM, a physics-guided framework that disentangles motion and appearance and incorporates Neural ODEs, resulting in significant outperformance over state-of-the-art models on physical reasoning and general video understanding tasks.

Video Large Language Models (Video LLMs) have shown impressive performance across a wide range of video-language tasks. However, they often fail in scenarios requiring a deeper understanding of physical dynamics. This limitation primarily arises from their reliance on appearance-based matching. Incorporating physical motion modeling is crucial for deeper video understanding, but presents three key challenges: (1) motion signals are often entangled with appearance variations, making it difficult to extract clean physical cues; (2) effective motion modeling requires not only continuous-time motion representations but also capturing physical dynamics; and (3) collecting accurate annotations for physical attributes is costly and often impractical. To address these issues, we propose PhyVLLM, a physical-guided video-language framework that explicitly incorporates physical motion into Video LLMs. Specifically, PhyVLLM disentangles visual appearance and object motion through a dual-branch encoder. To model physical dynamics over time, we incorporate a Neural Ordinary Differential Equation (Neural ODE) module, which generates differentiable physical dynamic representations. The resulting motion-aware representations are projected into the token space of a pretrained LLM, enabling physics reasoning without compromising the model's original multimodal capabilities. To circumvent the need for explicit physical labels, PhyVLLM employs a self-supervised manner to model the continuous evolution of object motion. Experimental results demonstrate that PhyVLLM significantly outperforms state-of-the-art Video LLMs on both physical reasoning and general video understanding tasks, highlighting the advantages of incorporating explicit physical modeling.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes