CVFeb 5

Exploring Physical Intelligence Emergence via Omni-Modal Architecture and Physical Data Engine

arXiv:2602.07064v1h-index: 27
Originality Incremental advance
AI Analysis

This work aims to improve physical understanding in omni-modal models, which is a significant problem for AI systems requiring robust interaction with the physical world, representing an incremental advancement in data generation and model architecture.

This paper addresses the brittleness of physical understanding in omni-modal models by introducing OmniFysics, a compact model that unifies understanding across various modalities and integrates speech and image generation. It leverages a physical data engine, FysicsAny, to generate physics-grounded instruction-image supervision and FysicsOmniCap to distill web videos for high-fidelity video-instruction pairs, leading to improved results on physics-oriented evaluations.

Physical understanding remains brittle in omni-modal models because key physical attributes are visually ambiguous and sparsely represented in web-scale data. We present OmniFysics, a compact omni-modal model that unifies understanding across images, audio, video, and text, with integrated speech and image generation. To inject explicit physical knowledge, we build a physical data engine with two components. FysicsAny produces physics-grounded instruction--image supervision by mapping salient objects to verified physical attributes through hierarchical retrieval over a curated prototype database, followed by physics-law--constrained verification and caption rewriting. FysicsOmniCap distills web videos via audio--visual consistency filtering to generate high-fidelity video--instruction pairs emphasizing cross-modal physical cues. We train OmniFysics with staged multimodal alignment and instruction tuning, adopt latent-space flow matching for text-to-image generation, and use an intent router to activate generation only when needed. Experiments show competitive performance on standard multimodal benchmarks and improved results on physics-oriented evaluations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes