CVNov 27, 2023Code
OccWorld: Learning a 3D Occupancy World Model for Autonomous DrivingWenzhao Zheng, Weiliang Chen, Yuanhui Huang et al. · tsinghua
Understanding how the 3D scene evolves is vital for making decisions in autonomous driving. Most existing methods achieve this by predicting the movements of object boxes, which cannot capture more fine-grained scene information. In this paper, we explore a new framework of learning a world model, OccWorld, in the 3D Occupancy space to simultaneously predict the movement of the ego car and the evolution of the surrounding scenes. We propose to learn a world model based on 3D occupancy rather than 3D bounding boxes and segmentation maps for three reasons: 1) expressiveness. 3D occupancy can describe the more fine-grained 3D structure of the scene; 2) efficiency. 3D occupancy is more economical to obtain (e.g., from sparse LiDAR points). 3) versatility. 3D occupancy can adapt to both vision and LiDAR. To facilitate the modeling of the world evolution, we learn a reconstruction-based scene tokenizer on the 3D occupancy to obtain discrete scene tokens to describe the surrounding scenes. We then adopt a GPT-like spatial-temporal generative transformer to generate subsequent scene and ego tokens to decode the future occupancy and ego trajectory. Extensive experiments on the widely used nuScenes benchmark demonstrate the ability of OccWorld to effectively model the evolution of the driving scenes. OccWorld also produces competitive planning results without using instance and map supervision. Code: https://github.com/wzzheng/OccWorld.
CVAug 22, 2024
DreamCinema: Cinematic Transfer with Free Camera and 3D CharacterWeiliang Chen, Fangfu Liu, Diankun Wu et al.
We are living in a flourishing era of digital media, where everyone has the potential to become a personal filmmaker. Current research on video generation suggests a promising avenue for controllable film creation in pixel space using Diffusion models. However, the reliance on overly verbose prompts and insufficient focus on cinematic elements (e.g., camera movement) results in videos that lack cinematic quality. Furthermore, the absence of 3D modeling often leads to failures in video generation, such as inconsistent character models at different frames, ultimately hindering the immersive experience for viewers. In this paper, we propose a new framework for film creation, Dream-Cinema, which is designed for user-friendly, 3D space-based film creation with generative models. Specifically, we decompose 3D film creation into four key elements: 3D character, driven motion, camera movement, and environment. We extract the latter three elements from user-specified film shots and generate the 3D character using a generative model based on a provided image. To seamlessly recombine these elements and ensure smooth film creation, we propose structure-guided character animation, shape-aware camera movement optimization, and environment-aware generative refinement. Extensive experiments demonstrate the effectiveness of our method in generating high-quality films with free camera and 3D characters.
CVJul 18, 2022
The Brain-Inspired Decoder for Natural Visual Image ReconstructionWenyi Li, Shengjie Zheng, Yufan Liao et al.
Decoding images from brain activity has been a challenge. Owing to the development of deep learning, there are available tools to solve this problem. The decoded image, which aims to map neural spike trains to low-level visual features and high-level semantic information space. Recently, there are a few studies of decoding from spike trains, however, these studies pay less attention to the foundations of neuroscience and there are few studies that merged receptive field into visual image reconstruction. In this paper, we propose a deep learning neural network architecture with biological properties to reconstruct visual image from spike trains. As far as we know, we implemented a method that integrated receptive field property matrix into loss function at the first time. Our model is an end-to-end decoder from neural spike trains to images. We not only merged Gabor filter into auto-encoder which used to generate images but also proposed a loss function with receptive field properties. We evaluated our decoder on two datasets which contain macaque primary visual cortex neural spikes and salamander retina ganglion cells (RGCs) spikes. Our results show that our method can effectively combine receptive field features to reconstruct images, providing a new approach to visual reconstruction based on neural information.
CVMar 19
Measuring 3D Spatial Geometric Consistency in Dynamic Generated VideosWeijia Dou, Wenzhao Zheng, Weiliang Chen et al.
Recent generative models can produce high-fidelity videos, yet they often exhibit 3D spatial geometric inconsistencies. Existing evaluation methods fail to accurately characterize these inconsistencies: fidelity-centric metrics like FVD are insensitive to geometric distortions, while consistency-focused benchmarks often penalize valid foreground dynamics. To address this gap, we introduce SGC, a metric for evaluating 3D \textbf{S}patial \textbf{G}eometric \textbf{C}onsistency in dynamically generated videos. We quantify geometric consistency by measuring the divergence among multiple camera poses estimated from distinct local regions. Our approach first separates static from dynamic regions, then partitions the static background into spatially coherent sub-regions. We predict depth for each pixel, estimate a local camera pose for each subregion, and compute the divergence among these poses to quantify geometric consistency. Experiments on real and generative videos demonstrate that SGC robustly quantifies geometric inconsistencies, effectively identifying critical failures missed by existing metrics.
LGJul 28, 2024
Reputation-Driven Asynchronous Federated Learning for Enhanced Trajectory Prediction with BlockchainWeiliang Chen, Li Jia, Yang Zhou et al.
Federated learning combined with blockchain empowers secure data sharing in autonomous driving applications. Nevertheless, with the increasing granularity and complexity of vehicle-generated data, the lack of data quality audits raises concerns about multi-party mistrust in trajectory prediction tasks. In response, this paper proposes an asynchronous federated learning data sharing method based on an interpretable reputation quantization mechanism utilizing graph neural network tools. Data providers share data structures under differential privacy constraints to ensure security while reducing redundant data. We implement deep reinforcement learning to categorize vehicles by reputation level, which optimizes the aggregation efficiency of federated learning. Experimental results demonstrate that the proposed data sharing scheme not only reinforces the security of the trajectory prediction task but also enhances prediction accuracy.
CVJul 6, 2023
Attentive Graph Enhanced Region Representation LearningWeiliang Chen, Qianqian Ren, Jinbao Li
Representing urban regions accurately and comprehensively is essential for various urban planning and analysis tasks. Recently, with the expansion of the city, modeling long-range spatial dependencies with multiple data sources plays an important role in urban region representation. In this paper, we propose the Attentive Graph Enhanced Region Representation Learning (ATGRL) model, which aims to capture comprehensive dependencies from multiple graphs and learn rich semantic representations of urban regions. Specifically, we propose a graph-enhanced learning module to construct regional graphs by incorporating mobility flow patterns, point of interests (POIs) functions, and check-in semantics with noise filtering. Then, we present a multi-graph aggregation module to capture both local and global spatial dependencies between regions by integrating information from multiple graphs. In addition, we design a dual-stage fusion module to facilitate information sharing between different views and efficiently fuse multi-view representations for urban region embedding using an improved linear attention mechanism. Finally, extensive experiments on real-world datasets for three downstream tasks demonstrate the superior performance of our model compared to state-of-the-art methods.
CVJun 12, 2025
SpectralAR: Spectral Autoregressive Visual GenerationYuanhui Huang, Weiliang Chen, Wenzhao Zheng et al.
Autoregressive visual generation has garnered increasing attention due to its scalability and compatibility with other modalities compared with diffusion models. Most existing methods construct visual sequences as spatial patches for autoregressive generation. However, image patches are inherently parallel, contradicting the causal nature of autoregressive modeling. To address this, we propose a Spectral AutoRegressive (SpectralAR) visual generation framework, which realizes causality for visual sequences from the spectral perspective. Specifically, we first transform an image into ordered spectral tokens with Nested Spectral Tokenization, representing lower to higher frequency components. We then perform autoregressive generation in a coarse-to-fine manner with the sequences of spectral tokens. By considering different levels of detail in images, our SpectralAR achieves both sequence causality and token efficiency without bells and whistles. We conduct extensive experiments on ImageNet-1K for image reconstruction and autoregressive generation, and SpectralAR achieves 3.02 gFID with only 64 tokens and 310M parameters. Project page: https://huang-yh.github.io/spectralar/.
CVJun 12, 2025
SceneCompleter: Dense 3D Scene Completion for Generative Novel View SynthesisWeiliang Chen, Jiayi Bi, Yuanhui Huang et al.
Generative models have gained significant attention in novel view synthesis (NVS) by alleviating the reliance on dense multi-view captures. However, existing methods typically fall into a conventional paradigm, where generative models first complete missing areas in 2D, followed by 3D recovery techniques to reconstruct the scene, which often results in overly smooth surfaces and distorted geometry, as generative models struggle to infer 3D structure solely from RGB data. In this paper, we propose SceneCompleter, a novel framework that achieves 3D-consistent generative novel view synthesis through dense 3D scene completion. SceneCompleter achieves both visual coherence and 3D-consistent generative scene completion through two key components: (1) a geometry-appearance dual-stream diffusion model that jointly synthesizes novel views in RGBD space; (2) a scene embedder that encodes a more holistic scene understanding from the reference image. By effectively fusing structural and textural information, our method demonstrates superior coherence and plausibility in generative novel view synthesis across diverse datasets. Project Page: https://chen-wl20.github.io/SceneCompleter
CVJan 19
Moaw: Unleashing Motion Awareness for Video Diffusion ModelsTianqi Zhang, Ziyi Wang, Wenzhao Zheng et al.
Video diffusion models, trained on large-scale datasets, naturally capture correspondences of shared features across frames. Recent works have exploited this property for tasks such as optical flow prediction and tracking in a zero-shot setting. Motivated by these findings, we investigate whether supervised training can more fully harness the tracking capability of video diffusion models. To this end, we propose Moaw, a framework that unleashes motion awareness for video diffusion models and leverages it to facilitate motion transfer. Specifically, we train a diffusion model for motion perception, shifting its modality from image-to-video generation to video-to-dense-tracking. We then construct a motion-labeled dataset to identify features that encode the strongest motion information, and inject them into a structurally identical video generation model. Owing to the homogeneity between the two networks, these features can be naturally adapted in a zero-shot manner, enabling motion transfer without additional adapters. Our work provides a new paradigm for bridging generative modeling and motion understanding, paving the way for more unified and controllable video learning frameworks.
CVOct 16, 2025
Terra: Explorable Native 3D World Model with Point LatentsYuanhui Huang, Weiliang Chen, Wenzhao Zheng et al.
World models have garnered increasing attention for comprehensive modeling of the real world. However, most existing methods still rely on pixel-aligned representations as the basis for world evolution, neglecting the inherent 3D nature of the physical world. This could undermine the 3D consistency and diminish the modeling efficiency of world models. In this paper, we present Terra, a native 3D world model that represents and generates explorable environments in an intrinsic 3D latent space. Specifically, we propose a novel point-to-Gaussian variational autoencoder (P2G-VAE) that encodes 3D inputs into a latent point representation, which is subsequently decoded as 3D Gaussian primitives to jointly model geometry and appearance. We then introduce a sparse point flow matching network (SPFlow) for generating the latent point representation, which simultaneously denoises the positions and features of the point latents. Our Terra enables exact multi-view consistency with native 3D representation and architecture, and supports flexible rendering from any viewpoint with only a single generation process. Furthermore, Terra achieves explorable world modeling through progressive generation in the point latent space. We conduct extensive experiments on the challenging indoor scenes from ScanNet v2. Terra achieves state-of-the-art performance in both reconstruction and generation with high 3D consistency.
CVJun 12, 2025
GenWorld: Towards Detecting AI-generated Real-world Simulation VideosWeiliang Chen, Wenzhao Zheng, Yu Zheng et al.
The flourishing of video generation technologies has endangered the credibility of real-world information and intensified the demand for AI-generated video detectors. Despite some progress, the lack of high-quality real-world datasets hinders the development of trustworthy detectors. In this paper, we propose GenWorld, a large-scale, high-quality, and real-world simulation dataset for AI-generated video detection. GenWorld features the following characteristics: (1) Real-world Simulation: GenWorld focuses on videos that replicate real-world scenarios, which have a significant impact due to their realism and potential influence; (2) High Quality: GenWorld employs multiple state-of-the-art video generation models to provide realistic and high-quality forged videos; (3) Cross-prompt Diversity: GenWorld includes videos generated from diverse generators and various prompt modalities (e.g., text, image, video), offering the potential to learn more generalizable forensic features. We analyze existing methods and find they fail to detect high-quality videos generated by world models (i.e., Cosmos), revealing potential drawbacks of ignoring real-world clues. To address this, we propose a simple yet effective model, SpannDetector, to leverage multi-view consistency as a strong criterion for real-world AI-generated video detection. Experiments show that our method achieves superior results, highlighting a promising direction for explainable AI-generated video detection based on physical plausibility. We believe that GenWorld will advance the field of AI-generated video detection. Project Page: https://chen-wl20.github.io/GenWorld
CVMar 14, 2024
Make-Your-3D: Fast and Consistent Subject-Driven 3D Content GenerationFangfu Liu, Hanyang Wang, Weiliang Chen et al.
Recent years have witnessed the strong power of 3D generation models, which offer a new level of creative flexibility by allowing users to guide the 3D content generation process through a single image or natural language. However, it remains challenging for existing 3D generation methods to create subject-driven 3D content across diverse prompts. In this paper, we introduce a novel 3D customization method, dubbed Make-Your-3D that can personalize high-fidelity and consistent 3D content from only a single image of a subject with text description within 5 minutes. Our key insight is to harmonize the distributions of a multi-view diffusion model and an identity-specific 2D generative model, aligning them with the distribution of the desired 3D subject. Specifically, we design a co-evolution framework to reduce the variance of distributions, where each model undergoes a process of learning from the other through identity-aware optimization and subject-prior optimization, respectively. Extensive experiments demonstrate that our method can produce high-quality, consistent, and subject-specific 3D content with text-driven modifications that are unseen in subject image.
CVFeb 2, 2024
Enhanced Urban Region Profiling with Adversarial Self-Supervised Learning for Robust Forecasting and SecurityWeiliang Chen, Qianqian Ren, Yong Liu et al.
Urban region profiling plays a crucial role in forecasting and decision-making in the context of dynamic and noisy urban environments. Existing methods often struggle with issues such as noise, data incompleteness, and security vulnerabilities. This paper proposes a novel framework, Enhanced Urban Region Profiling with Adversarial Self-Supervised Learning (EUPAS), to address these challenges. By combining adversarial contrastive learning with both supervised and self-supervised objectives, EUPAS ensures robust performance across various forecasting tasks such as crime prediction, check-in prediction, and land use classification. To enhance model resilience against adversarial attacks and noisy data, we incorporate several key components, including perturbation augmentation, trickster generator, and deviation copy generator. These innovations effectively improve the robustness of the embeddings, making EUPAS capable of handling the complexities and noise inherent in urban data. Experimental results show that EUPAS significantly outperforms state-of-the-art methods across multiple tasks, achieving improvements in prediction accuracy of up to 10.8%. Notably, our model excels in adversarial attack tests, demonstrating its resilience in real-world, security-sensitive applications. This work makes a substantial contribution to the field of urban analytics by offering a more robust and secure approach to forecasting and profiling urban regions. It addresses key challenges in secure, data-driven modeling, providing a stronger foundation for future urban analytics and decision-making applications.
CVMay 11, 2023
Intuitive Surgical SurgToolLoc Challenge Results: 2022-2023Aneeq Zia, Max Berniker, Rogerio Garcia Nespolo et al.
Robotic assisted (RA) surgery promises to transform surgical intervention. Intuitive Surgical is committed to fostering these changes and the machine learning models and algorithms that will enable them. With these goals in mind we have invited the surgical data science community to participate in a yearly competition hosted through the Medical Imaging Computing and Computer Assisted Interventions (MICCAI) conference. With varying changes from year to year, we have challenged the community to solve difficult machine learning problems in the context of advanced RA applications. Here we document the results of these challenges, focusing on surgical tool localization (SurgToolLoc). The publicly released dataset that accompanies these challenges is detailed in a separate paper arXiv:2501.09209 [1].