CVAIMar 3

ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

arXiv:2603.02697v12 citationsh-index: 4
Originality Highly original
AI Analysis

This work addresses the gap in existing video generation frameworks for multi-agent interaction, providing a solution for shared world modeling in applications such as autonomous driving or robotics.

ShareVerse tackles the problem of multi-agent shared world modeling in video generation, achieving consistent shared world modeling and accurately perceiving the position of dynamic agents. The framework supports 49-frame large-scale video generation.

This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction with multi-agent interaction. ShareVerse leverages the generation capability of large video models and integrates three key innovations: 1) A dataset for large-scale multi-agent interactive world modeling is built on the CARLA simulation platform, featuring diverse scenes, weather conditions, and interactive trajectories with paired multi-view videos (front/ rear/ left/ right views per agent) and camera data. 2) We propose a spatial concatenation strategy for four-view videos of independent agents to model a broader environment and to ensure internal multi-view geometric consistency. 3) We integrate cross-agent attention blocks into the pretrained video model, which enable interactive transmission of spatial-temporal information across agents, guaranteeing shared world consistency in overlapping regions and reasonable generation in non-overlapping regions. ShareVerse, which supports 49-frame large-scale video generation, accurately perceives the position of dynamic agents and achieves consistent shared world modeling.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes