CVAILGApr 2

ActionParty: Multi-Subject Action Binding in Generative Video Games

arXiv:2604.0233093.91 citations
AI Analysis

This addresses the limitation of single-agent control in generative video games for developers and researchers, representing a novel method rather than an incremental improvement.

The paper tackled the problem of action binding in video diffusion models, which struggle to associate specific actions with multiple subjects, by proposing ActionParty, a multi-subject world model that achieved the first capability to control up to seven players simultaneously across 46 environments, with significant improvements in action-following accuracy and identity consistency.

Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes