CVFeb 3

ConsisDrive: Identity-Preserving Driving World Models for Video Generation by Instance Mask

arXiv:2602.03213v11 citationsh-index: 2
AI Analysis

This addresses identity consistency for autonomous driving systems by providing a cost-effective method to generate realistic driving data, though it appears incremental as it builds on existing world models.

The paper tackled the problem of identity drift in driving world models for video generation, where objects change appearance across frames, by introducing ConsisDrive with instance-masked attention and loss, achieving state-of-the-art generation quality and improvements in downstream tasks on the nuScenes dataset.

Autonomous driving relies on robust models trained on large-scale, high-quality multi-view driving videos. Although world models provide a cost-effective solution for generating realistic driving data, they often suffer from identity drift, where the same object changes its appearance or category across frames due to the absence of instance-level temporal constraints. We introduce ConsisDrive, an identity-preserving driving world model designed to enforce temporal consistency at the instance level. Our framework incorporates two key components: (1) Instance-Masked Attention, which applies instance identity masks and trajectory masks within attention blocks to ensure that visual tokens interact only with their corresponding instance features across spatial and temporal dimensions, thereby preserving object identity consistency; and (2) Instance-Masked Loss, which adaptively emphasizes foreground regions with probabilistic instance masking, reducing background noise while maintaining overall scene fidelity. By integrating these mechanisms, ConsisDrive achieves state-of-the-art driving video generation quality and demonstrates significant improvements in downstream autonomous driving tasks on the nuScenes dataset. Our project page is https://shanpoyang654.github.io/ConsisDrive/page.html.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes