CVOct 1, 2025

From Seeing to Predicting: A Vision-Language Framework for Trajectory Forecasting and Controlled Video Generation

arXiv:2510.00806v11 citationsh-index: 8
Originality Incremental advance
AI Analysis

This addresses physically unrealistic video generation for applications in simulation and media, but it is incremental as it builds on existing trajectory and video generation methods.

The paper tackles the problem of physically inconsistent motion in video generation by proposing TrajVLM-Gen, a two-stage framework that uses a Vision Language Model to predict physics-aware trajectories and refines them for video generation, achieving FVD scores of 545 on UCF-101 and 539 on MSR-VTT.

Current video generation models produce physically inconsistent motion that violates real-world dynamics. We propose TrajVLM-Gen, a two-stage framework for physics-aware image-to-video generation. First, we employ a Vision Language Model to predict coarse-grained motion trajectories that maintain consistency with real-world physics. Second, these trajectories guide video generation through attention-based mechanisms for fine-grained motion refinement. We build a trajectory prediction dataset based on video tracking data with realistic motion patterns. Experiments on UCF-101 and MSR-VTT demonstrate that TrajVLM-Gen outperforms existing methods, achieving competitive FVD scores of 545 on UCF-101 and 539 on MSR-VTT.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes