CVMar 16

AnyCrowd: Instance-Isolated Identity-Pose Binding for Arbitrary Multi-Character Animation

arXiv:2603.1541583.8h-index: 17
AI Analysis

This work addresses the underexplored challenge of scalable multi-character animation for applications in video generation and computer graphics, representing an incremental advancement with specific technical improvements.

The paper tackles the problem of multi-character animation by addressing identity entanglement and mis-binding, proposing AnyCrowd, a Diffusion Transformer-based framework that scales to arbitrary numbers of characters and achieves improved controllability and consistency in generated videos.

Controllable character animation has advanced rapidly in recent years, yet multi-character animation remains underexplored. As the number of characters grows, multi-character reference encoding becomes more susceptible to latent identity entanglement, resulting in identity bleeding and reduced controllability. Moreover, learning precise and spatio-temporally consistent correspondences between reference identities and driving pose sequences becomes increasingly challenging, often leading to identity-pose mis-binding and inconsistency in generated videos. To address these challenges, we propose AnyCrowd, a Diffusion Transformer (DiT)-based video generation framework capable of scaling to an arbitrary number of characters. Specifically, we first introduce an Instance-Isolated Latent Representation (IILR), which encodes character instances independently prior to DiT processing to prevent latent identity entanglement. Building on this disentangled representation, we further propose Tri-Stage Decoupled Attention (TSDA) to bind identities to driving poses by decomposing self-attention into: (i) instance-aware foreground attention, (ii) background-centric interaction, and (iii) global foreground-background coordination. Furthermore, to mitigate token ambiguity in overlapping regions, an Adaptive Gated Fusion (AGF) module is integrated within TSDA to predict identity-aware weights, effectively fusing competing token groups into identity-consistent representations...

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes