CVMay 21, 2025

Interspatial Attention for Efficient 4D Human Video Generation

arXiv:2505.15800v28 citationsh-index: 27Has Code
Originality Incremental advance
AI Analysis

This addresses the challenge of high-quality human video generation for applications like virtual reality and entertainment, representing an incremental advance over existing methods.

The paper tackles the problem of generating photorealistic, controllable videos of digital humans by introducing an interspatial attention mechanism, achieving state-of-the-art performance with improved motion consistency and identity preservation.

Generating photorealistic videos of digital humans in a controllable manner is crucial for a plethora of applications. Existing approaches either build on methods that employ template-based 3D representations or emerging video generation models but suffer from poor quality or limited consistency and identity preservation when generating individual or multiple digital humans. In this paper, we introduce a new interspatial attention (ISA) mechanism as a scalable building block for modern diffusion transformer (DiT)--based video generation models. ISA is a new type of cross attention that uses relative positional encodings tailored for the generation of human videos. Leveraging a custom-developed video variation autoencoder, we train a latent ISA-based diffusion model on a large corpus of video data. Our model achieves state-of-the-art performance for 4D human video synthesis, demonstrating remarkable motion consistency and identity preservation while providing precise control of the camera and body poses. Our code and model are publicly released at https://dsaurus.github.io/isa4d/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes