63.5CVJun 2
PatchScene: Patch-based Voxel Diffusion for Large-Scale Scene CompletionQingdong Xu, Jiajun Zhu, Shilin Zhu et al.
We propose PatchScene, a novel diffusion-based framework for large-scale LiDAR scene completion. Unlike existing methods that rely on global latent representations or dense voxel grids, PatchScene adopts a patch-based voxel diffusion paradigm that explicitly generates fine-grained geometry within localized 3D regions. To ensure coherent reconstruction at both spatial and temporal scales, we introduce a confidence-guided spatio-temporal fusion mechanism that integrates overlapping patches and adjacent frames in a unified generative process. Furthermore, we design an Annular-Flow diffusion strategy that leverages the radial density pattern of LiDAR scans to progressively propagate high-fidelity information from near-range to far-range regions, enabling spatially unbounded scene completion. Extensive experiments on the SemanticKITTI benchmark demonstrate that PatchScene achieves state-of-the-art performance across all standard metrics, surpassing previous approaches in both geometric accuracy and temporal consistency. Remarkably, the model trained on 20 m LiDAR ranges generalizes effectively to 50 m scenes without retraining, highlighting its strong scalability and generalization capability for real-world autonomous driving applications.
CVApr 24, 2025
The Fourth Monocular Depth Estimation ChallengeAnton Obukhov, Matteo Poggi, Fabio Tosi et al.
This paper presents the results of the fourth edition of the Monocular Depth Estimation Challenge (MDEC), which focuses on zero-shot generalization to the SYNS-Patches benchmark, a dataset featuring challenging environments in both natural and indoor settings. In this edition, we revised the evaluation protocol to use least-squares alignment with two degrees of freedom to support disparity and affine-invariant predictions. We also revised the baselines and included popular off-the-shelf methods: Depth Anything v2 and Marigold. The challenge received a total of 24 submissions that outperformed the baselines on the test set; 10 of these included a report describing their approach, with most leading methods relying on affine-invariant predictions. The challenge winners improved the 3D F-Score over the previous edition's best result, raising it from 22.58% to 23.05%.
CVNov 6, 2024
Act in Collusion: A Persistent Distributed Multi-Target Backdoor in Federated LearningTao Liu, Wu Yang, Chen Xu et al.
Federated learning, a novel paradigm designed to protect data privacy, is vulnerable to backdoor attacks due to its distributed nature. Current research often designs attacks based on a single attacker with a single backdoor, overlooking more realistic and complex threats in federated learning. We propose a more practical threat model for federated learning: the distributed multi-target backdoor. In this model, multiple attackers control different clients, embedding various triggers and targeting different classes, collaboratively implanting backdoors into the global model via central aggregation. Empirical validation shows that existing methods struggle to maintain the effectiveness of multiple backdoors in the global model. Our key insight is that similar backdoor triggers cause parameter conflicts and injecting new backdoors disrupts gradient directions, significantly weakening some backdoors performance. To solve this, we propose a Distributed Multi-Target Backdoor Attack (DMBA), ensuring efficiency and persistence of backdoors from different malicious clients. To avoid parameter conflicts, we design a multi-channel dispersed frequency trigger strategy to maximize trigger differences. To mitigate gradient interference, we introduce backdoor replay in local training to neutralize conflicting gradients. Extensive validation shows that 30 rounds after the attack, Attack Success Rates of three different backdoors from various clients remain above 93%. The code will be made publicly available after the review period.
CVAug 4, 2025
VDEGaussian: Video Diffusion Enhanced 4D Gaussian Splatting for Dynamic Urban Scenes ModelingYuru Xiao, Zihan Lin, Chao Lu et al.
Dynamic urban scene modeling is a rapidly evolving area with broad applications. While current approaches leveraging neural radiance fields or Gaussian Splatting have achieved fine-grained reconstruction and high-fidelity novel view synthesis, they still face significant limitations. These often stem from a dependence on pre-calibrated object tracks or difficulties in accurately modeling fast-moving objects from undersampled capture, particularly due to challenges in handling temporal discontinuities. To overcome these issues, we propose a novel video diffusion-enhanced 4D Gaussian Splatting framework. Our key insight is to distill robust, temporally consistent priors from a test-time adapted video diffusion model. To ensure precise pose alignment and effective integration of this denoised content, we introduce two core innovations: a joint timestamp optimization strategy that refines interpolated frame poses, and an uncertainty distillation method that adaptively extracts target content while preserving well-reconstructed regions. Extensive experiments demonstrate that our method significantly enhances dynamic modeling, especially for fast-moving objects, achieving an approximate PSNR gain of 2 dB for novel view synthesis over baseline approaches.
AIJul 29, 2025
MultiEditor: Controllable Multimodal Object Editing for Driving Scenarios Using 3D Gaussian Splatting PriorsShouyi Lu, Zihan Lin, Chao Lu et al.
Autonomous driving systems rely heavily on multimodal perception data to understand complex environments. However, the long-tailed distribution of real-world data hinders generalization, especially for rare but safety-critical vehicle categories. To address this challenge, we propose MultiEditor, a dual-branch latent diffusion framework designed to edit images and LiDAR point clouds in driving scenarios jointly. At the core of our approach is introducing 3D Gaussian Splatting (3DGS) as a structural and appearance prior for target objects. Leveraging this prior, we design a multi-level appearance control mechanism--comprising pixel-level pasting, semantic-level guidance, and multi-branch refinement--to achieve high-fidelity reconstruction across modalities. We further propose a depth-guided deformable cross-modality condition module that adaptively enables mutual guidance between modalities using 3DGS-rendered depth, significantly enhancing cross-modality consistency. Extensive experiments demonstrate that MultiEditor achieves superior performance in visual and geometric fidelity, editing controllability, and cross-modality consistency. Furthermore, generating rare-category vehicle data with MultiEditor substantially enhances the detection accuracy of perception models on underrepresented classes.