CVOct 9, 2023Code
HarmonicNeRF: Geometry-Informed Synthetic View Augmentation for 3D Scene Reconstruction in Driving ScenariosXiaochao Pan, Jiawei Yao, Hongrui Kou et al.
In the realm of autonomous driving, achieving precise 3D reconstruction of the driving environment is critical for ensuring safety and effective navigation. Neural Radiance Fields (NeRF) have shown promise in creating highly detailed and accurate models of complex environments. However, the application of NeRF in autonomous driving scenarios encounters several challenges, primarily due to the sparsity of viewpoints inherent in camera trajectories and the constraints on data collection in unbounded outdoor scenes, which typically occur along predetermined paths. This limitation not only reduces the available scene information but also poses significant challenges for NeRF training, as the sparse and path-distributed observational data leads to under-representation of the scene's geometry. In this paper, we introduce HarmonicNeRF, a novel approach for outdoor self-supervised monocular scene reconstruction. HarmonicNeRF capitalizes on the strengths of NeRF and enhances surface reconstruction accuracy by augmenting the input space with geometry-informed synthetic views. This is achieved through the application of spherical harmonics to generate novel radiance values, taking into careful consideration the color observations from the limited available real-world views. Additionally, our method incorporates proxy geometry to effectively manage occlusion, generating radiance pseudo-labels that circumvent the limitations of traditional image-warping techniques, which often fail in sparse data conditions typical of autonomous driving environments. Extensive experiments conducted on the KITTI, Argoverse, and NuScenes datasets demonstrate our approach establishes new benchmarks in synthesizing novel depth views and reconstructing scenes, significantly outperforming existing methods. Project page: https://github.com/Jiawei-Yao0812/HarmonicNeRF
CVNov 28, 2023
DepthSSC: Monocular 3D Semantic Scene Completion via Depth-Spatial Alignment and Voxel AdaptationJiawei Yao, Jusheng Zhang, Xiaochao Pan et al.
The task of 3D semantic scene completion using monocular cameras is gaining significant attention in the field of autonomous driving. This task aims to predict the occupancy status and semantic labels of each voxel in a 3D scene from partial image inputs. Despite numerous existing methods, many face challenges such as inaccurately predicting object shapes and misclassifying object boundaries. To address these issues, we propose DepthSSC, an advanced method for semantic scene completion using only monocular cameras. DepthSSC integrates the Spatial Transformation Graph Fusion (ST-GF) module with Geometric-Aware Voxelization (GAV), enabling dynamic adjustment of voxel resolution to accommodate the geometric complexity of 3D space. This ensures precise alignment between spatial and depth information, effectively mitigating issues such as object boundary distortion and incorrect depth perception found in previous methods. Evaluations on the SemanticKITTI and SSCBench-KITTI-360 dataset demonstrate that DepthSSC not only captures intricate 3D structural details effectively but also achieves state-of-the-art performance.
CVSep 27, 2025Code
C3-OWD: A Curriculum Cross-modal Contrastive Learning Framework for Open-World DetectionSiheng Wang, Zhengdao Li, Yanshu Li et al.
Object detection has advanced significantly in the closed-set setting, but real-world deployment remains limited by two challenges: poor generalization to unseen categories and insufficient robustness under adverse conditions. Prior research has explored these issues separately: visible-infrared detection improves robustness but lacks generalization, while open-world detection leverages vision-language alignment strategy for category diversity but struggles under extreme environments. This trade-off leaves robustness and diversity difficult to achieve simultaneously. To mitigate these issues, we propose \textbf{C3-OWD}, a curriculum cross-modal contrastive learning framework that unifies both strengths. Stage~1 enhances robustness by pretraining with RGBT data, while Stage~2 improves generalization via vision-language alignment. To prevent catastrophic forgetting between two stages, we introduce an Exponential Moving Average (EMA) mechanism that theoretically guarantees preservation of pre-stage performance with bounded parameter lag and function consistency. Experiments on FLIR, OV-COCO, and OV-LVIS demonstrate the effectiveness of our approach: C3-OWD achieves $80.1$ AP$^{50}$ on FLIR, $48.6$ AP$^{50}_{\text{Novel}}$ on OV-COCO, and $35.7$ mAP$_r$ on OV-LVIS, establishing competitive performance across both robustness and diversity evaluations. Code available at: https://github.com/justin-herry/C3-OWD.git.
LGMar 6
How to Achieve Prototypical Birth and Death for OOD Detection?Ningkang Peng, Qianfeng Yu, Xiaoqian Peng et al.
Out-of-Distribution (OOD) detection is crucial for the secure deployment of machine learning models, and prototype-based learning methods are among the mainstream strategies for achieving OOD detection. Existing prototype-based learning methods generally rely on a fixed number of prototypes. This static assumption fails to adapt to the inherent complexity differences across various categories. Currently, there is still a lack of a mechanism that can adaptively adjust the number of prototypes based on data complexity. Inspired by the processes of cell birth and death in biology, we propose a novel method named PID (Prototype bIrth and Death) to adaptively adjust the prototype count based on data complexity. This method relies on two dynamic mechanisms during the training process: prototype birth and prototype death. The birth mechanism instantiates new prototypes in data regions with insufficient representation by identifying the overload level of existing prototypes, thereby meticulously capturing intra-class substructures. Conversely, the death mechanism reinforces the decision boundary by pruning prototypes with ambiguous class boundaries through evaluating their discriminability. Through birth and death, the number of prototypes can be dynamically adjusted according to the data complexity, leading to the learning of more compact and better-separated In-Distribution (ID) embeddings, which significantly enhances the capability to detect OOD samples. Experiments demonstrate that our dynamic method, PID, significantly outperforms existing methods on benchmarks such as CIFAR-100, achieving State-of-the-Art (SOTA) performance, especially on the FPR95 metric.
LGDec 22, 2025
From Points to Coalitions: Hierarchical Contrastive Shapley Values for Prioritizing Data SamplesCanran Xiao, Jiabao Dou, Zhiming Lin et al.
How should we quantify the value of each training example when datasets are large, heterogeneous, and geometrically structured? Classical Data-Shapley answers in principle, but its O(n!) complexity and point-wise perspective are ill-suited to modern scales. We propose Hierarchical Contrastive Data Valuation (HCDV), a three-stage framework that (i) learns a contrastive, geometry-preserving representation, (ii) organizes the data into a balanced coarse-to-fine hierarchy of clusters, and (iii) assigns Shapley-style payoffs to coalitions via local Monte-Carlo games whose budgets are propagated downward. HCDV collapses the factorial burden to O(T sum_{l} K_{l}) = O(T K_max log n), rewards examples that sharpen decision boundaries, and regularizes outliers through curvature-based smoothness. We prove that HCDV approximately satisfies the four Shapley axioms with surplus loss O(eta log n), enjoys sub-Gaussian coalition deviation tilde O(1/sqrt{T}), and incurs at most k epsilon_infty regret for top-k selection. Experiments on four benchmarks--tabular, vision, streaming, and a 45M-sample CTR task--plus the OpenDataVal suite show that HCDV lifts accuracy by up to +5 pp, slashes valuation time by up to 100x, and directly supports tasks such as augmentation filtering, low-latency streaming updates, and fair marketplace payouts.
CLDec 30, 2025
CEC-Zero: Zero-Supervision Character Error Correction with Self-Generated RewardsZhiming Lin, Kai Zhao, Sophie Zhang et al.
Large-scale Chinese spelling correction (CSC) remains critical for real-world text processing, yet existing LLMs and supervised methods lack robustness to novel errors and rely on costly annotations. We introduce CEC-Zero, a zero-supervision reinforcement learning framework that addresses this by enabling LLMs to correct their own mistakes. CEC-Zero synthesizes errorful inputs from clean text, computes cluster-consensus rewards via semantic similarity and candidate agreement, and optimizes the policy with PPO. It outperforms supervised baselines by 10--13 F$_1$ points and strong LLM fine-tunes by 5--8 points across 9 benchmarks, with theoretical guarantees of unbiased rewards and convergence. CEC-Zero establishes a label-free paradigm for robust, scalable CSC, unlocking LLM potential in noisy text pipelines.
71.3LGMar 29
Prototype-Aligned Federated Soft-Prompts for Continual Web PersonalizationCanran Xiao, Liwei Hou
Continual web personalization is essential for engagement, yet real-world non-stationarity and privacy constraints make it hard to adapt quickly without forgetting long-term preferences. We target this gap by seeking a privacy-conscious, parameter-efficient interface that controls stability-plasticity at the user/session level while tying user memory to a shared semantic prior. We propose ProtoFed-SP, a prompt-based framework that injects dual-timescale soft prompts into a frozen backbone: a fast, sparse short-term prompt tracks session intent, while a slow long-term prompt is anchored to a small server-side prototype library that is continually refreshed via differentially private federated aggregation. Queries are routed to Top-M prototypes to compose a personalized prompt. Across eight benchmarks, ProtoFed-SP improves NDCG@10 by +2.9% and HR@10 by +2.0% over the strongest baselines, with notable gains on Amazon-Books (+5.0% NDCG vs. INFER), H&M (+2.5% vs. Dual-LoRA), and Taobao (+2.2% vs. FedRAP). It also lowers forgetting (AF) and Steps-to-95% and preserves accuracy under practical DP budgets. Our contribution is a unifying, privacy-aware prompting interface with prototype anchoring that delivers robust continual personalization and offers a transparent, controllable mechanism to balance stability and plasticity in deployment.
CVNov 30, 2025
Affordance-First Decomposition for Continual Learning in Video-Language UnderstandingMengzhu Xu, Hanzhi Liu, Ningkang Peng et al.
Continual learning for video--language understanding is increasingly important as models face non-stationary data, domains, and query styles, yet prevailing solutions blur what should stay stable versus what should adapt, rely on static routing/capacity, or require replaying past videos. We aim to explicitly specify where stability lives and where plasticity should be focused under realistic memory and privacy constraints. We introduce Affordance-First Decomposition (AFD): videos are mapped to slowly varying affordance tokens that form a shared, time-aligned substrate, while a lightweight, query-routed, conflict-aware scheduler concentrates adaptation and grows capacity only when needed. The substrate is stabilized via weak alignment and teacher consistency, and training uses question-only replay. AFD achieves state-of-the-art across protocols: 51.6% average accuracy with -1.8% forgetting on domain-incremental VideoQA, ViLCo R@1@0.5 of 29.6% (MQ) and 20.7% (NLQ) with 18.4% stAP@0.25 (VQ), and 39.5% accuracy with -1.6% forgetting on time-incremental iVQA. Overall, AFD offers an explicit, interpretable split between a stable interaction-centered substrate and targeted adaptation.
LGOct 20, 2025
Curiosity Meets Cooperation: A Game-Theoretic Approach to Long-Tail Multi-Label LearningCanran Xiao, Chuangxin Zhao, Zong Ke et al.
Long-tail imbalance is endemic to multi-label learning: a few head labels dominate the gradient signal, while the many rare labels that matter in practice are silently ignored. We tackle this problem by casting the task as a cooperative potential game. In our Curiosity-Driven Game-Theoretic Multi-Label Learning (CD-GTMLL) framework, the label space is split among several cooperating players that share a global accuracy payoff yet earn additional curiosity rewards that rise with label rarity and inter-player disagreement. These curiosity bonuses inject gradient on under-represented tags without hand-tuned class weights. We prove that gradient best-response updates ascend a differentiable potential and converge to tail-aware stationary points that tighten a lower bound on the expected Rare-F1. Extensive experiments on conventional benchmarks and three extreme-scale datasets show consistent state-of-the-art gains, delivering up to +4.3% Rare-F1 and +1.6% P@3 over the strongest baselines, while ablations reveal emergent division of labour and faster consensus on rare classes. CD-GTMLL thus offers a principled, scalable route to long-tail robustness in multi-label prediction.
ROJul 26, 2025
DOA: A Degeneracy Optimization Agent with Adaptive Pose Compensation Capability based on Deep Reinforcement LearningYanbin Li, Canran Xiao, Hongyang He et al.
Particle filter-based 2D-SLAM is widely used in indoor localization tasks due to its efficiency. However, indoor environments such as long straight corridors can cause severe degeneracy problems in SLAM. In this paper, we use Proximal Policy Optimization (PPO) to train an adaptive degeneracy optimization agent (DOA) to address degeneracy problem. We propose a systematic methodology to address three critical challenges in traditional supervised learning frameworks: (1) data acquisition bottlenecks in degenerate dataset, (2) inherent quality deterioration of training samples, and (3) ambiguity in annotation protocol design. We design a specialized reward function to guide the agent in developing perception capabilities for degenerate environments. Using the output degeneracy factor as a reference weight, the agent can dynamically adjust the contribution of different sensors to pose optimization. Specifically, the observation distribution is shifted towards the motion model distribution, with the step size determined by a linear interpolation formula related to the degeneracy factor. In addition, we employ a transfer learning module to endow the agent with generalization capabilities across different environments and address the inefficiency of training in degenerate environments. Finally, we conduct ablation studies to demonstrate the rationality of our model design and the role of transfer learning. We also compare the proposed DOA with SOTA methods to prove its superior degeneracy detection and optimization capabilities across various environments.
LGAug 24, 2025
ReviBranch: Deep Reinforcement Learning for Branch-and-Bound with Revived TrajectoriesDou Jiabao, Nie Jiayi, Yihang Cheng et al.
The Branch-and-bound (B&B) algorithm is the main solver for Mixed Integer Linear Programs (MILPs), where the selection of branching variable is essential to computational efficiency. However, traditional heuristics for branching often fail to generalize across heterogeneous problem instances, while existing learning-based methods such as imitation learning (IL) suffers from dependence on expert demonstration quality, and reinforcement learning (RL) struggles with limitations in sparse rewards and dynamic state representation challenges. To address these issues, we propose ReviBranch, a novel deep RL framework that constructs revived trajectories by reviving explicit historical correspondences between branching decisions and their corresponding graph states along search-tree paths. During training, ReviBranch enables agents to learn from complete structural evolution and temporal dependencies within the branching process. Additionally, we introduce an importance-weighted reward redistribution mechanism that transforms sparse terminal rewards into dense stepwise feedback, addressing the sparse reward challenge. Extensive experiments on different MILP benchmarks demonstrate that ReviBranch outperforms state-of-the-art RL methods, reducing B&B nodes by 4.0% and LP iterations by 2.2% on large-scale instances. The results highlight the robustness and generalizability of ReviBranch across heterogeneous MILP problem classes.
CVNov 21, 2025
Where Culture Fades: Revealing the Cultural Gap in Text-to-Image GenerationChuancheng Shi, Shangze Li, Shiming Guo et al.
Multilingual text-to-image (T2I) models have advanced rapidly in terms of visual realism and semantic alignment, and are now widely utilized. Yet outputs vary across cultural contexts: because language carries cultural connotations, images synthesized from multilingual prompts should preserve cross-lingual cultural consistency. We conduct a comprehensive analysis showing that current T2I models often produce culturally neutral or English-biased results under multilingual prompts. Analyses of two representative models indicate that the issue stems not from missing cultural knowledge but from insufficient activation of culture-related representations. We propose a probing method that localizes culture-sensitive signals to a small set of neurons in a few fixed layers. Guided by this finding, we introduce two complementary alignment strategies: (1) inference-time cultural activation that amplifies the identified neurons without backbone fine-tuned; and (2) layer-targeted cultural enhancement that updates only culturally relevant layers. Experiments on our CultureBench demonstrate consistent improvements over strong baselines in cultural consistency while preserving fidelity and diversity.