SLAP: The Semantic Least Action Principle for Variational Video-Language Modeling
This work tackles the problem of maintaining semantic consistency over long video horizons for Large Video-Language Models, which is a significant challenge for the field.
This paper addresses the temporal gap in Large Video-Language Models (LVLMs) caused by sparse frame sampling, which leads to models missing critical causal transitions and suffering from object vanishing and energetic instability. The authors propose the Semantic Least Action Principle (SLAP), which models latent video trajectories as paths on a Riemannian manifold governed by a Semantic Lagrangian, effectively enforcing object persistence without pixel-level rendering.
In the era of Large Video-Language Models (LVLMs), the computational necessity of sparse frame sampling creates a fundamental ``temporal gap'', rendering models blind to critical causal transitions. Existing solutions relying on generative hallucination (e.g., latent diffusion) or autoregressive extrapolation often fail to maintain semantic consistency over long horizons, suffering from object vanishing and energetic instability. We propose a paradigm shift from probabilistic generation to variational mechanics with the \textbf{Semantic Least Action Principle (SLAP)}. Drawing a rigorous isomorphism between classical mechanics and semantic dynamics, we model the latent video trajectory as a path on a Riemannian manifold governed by a Semantic Lagrangian. By formulating the interpolation task as a Boundary Value Problem (BVP) solved via the discrete Euler-Lagrange equations, SLAP naturally enforces object persistence without pixel-level rendering. Extensive experiments show the effectiveness of our proposed SLAP.