CVOct 23, 2025

Evaluating Video Models as Simulators of Multi-Person Pedestrian Trajectories

arXiv:2510.20182v1h-index: 1
Originality Incremental advance
AI Analysis

This addresses the need to verify the plausibility of multi-agent dynamics in generated videos for applications in simulation and AI, representing an incremental step by focusing on a specific gap in existing benchmarks.

The paper tackled the problem of evaluating video generation models as simulators of multi-person pedestrian trajectories, proposing a rigorous evaluation protocol for text-to-video and image-to-video models, and found that leading models have learned effective priors for plausible multi-agent behavior but still exhibit failure modes like merging and disappearing people.

Large-scale video generation models have demonstrated high visual realism in diverse contexts, spurring interest in their potential as general-purpose world simulators. Existing benchmarks focus on individual subjects rather than scenes with multiple interacting people. However, the plausibility of multi-agent dynamics in generated videos remains unverified. We propose a rigorous evaluation protocol to benchmark text-to-video (T2V) and image-to-video (I2V) models as implicit simulators of pedestrian dynamics. For I2V, we leverage start frames from established datasets to enable comparison with a ground truth video dataset. For T2V, we develop a prompt suite to explore diverse pedestrian densities and interactions. A key component is a method to reconstruct 2D bird's-eye view trajectories from pixel-space without known camera parameters. Our analysis reveals that leading models have learned surprisingly effective priors for plausible multi-agent behavior. However, failure modes like merging and disappearing people highlight areas for future improvement.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes