Zhuolin He

CV
h-index8
5papers
4citations
Novelty51%
AI Score47

5 Papers

CVMar 1Code
Vision-Language Feature Alignment for Road Anomaly Segmentation

Zhuolin He, Jiacheng Tang, Jian Pu et al.

Safe autonomous systems in complex environments require robust road anomaly segmentation to identify unknown obstacles. However, existing approaches often rely on pixel-level statistics to determine whether a region appears anomalous. This reliance leads to high false-positive rates on semantically normal background regions such as sky or vegetation, and poor recall of true Out-of-distribution (OOD) instances, thereby posing safety risks for robotic perception and decision-making. To address these challenges, we propose VL-Anomaly, a vision-language anomaly segmentation framework that incorporates semantic priors from pre-trained Vision-Language Models (VLMs). Specifically, we design a prompt learning-driven alignment module that adapts Mask2Forme's visual features to CLIP text embeddings of known categories, effectively suppressing spurious anomaly responses in background regions. At inference time, we further introduce a multi-source inference strategy that integrates text-guided similarity, CLIP-based image-text similarity and detector confidence, enabling more reliable anomaly prediction by leveraging complementary information sources. Extensive experiments demonstrate that VL-Anomaly achieves state-of-the-art performance on benchmark datasets including RoadAnomaly, SMIYC and Fishyscapes.Code is released on https://github.com/NickHezhuolin/VL-aligner-Road-anomaly-segment.

CVMar 22, 2025Code
Multi-modality Anomaly Segmentation on the Road

Heng Gao, Zhuolin He, Shoumeng Qiu et al.

Semantic segmentation allows autonomous driving cars to understand the surroundings of the vehicle comprehensively. However, it is also crucial for the model to detect obstacles that may jeopardize the safety of autonomous driving systems. Based on our experiments, we find that current uni-modal anomaly segmentation frameworks tend to produce high anomaly scores for non-anomalous regions in images. Motivated by this empirical finding, we develop a multi-modal uncertainty-based anomaly segmentation framework, named MMRAS+, for autonomous driving systems. MMRAS+ effectively reduces the high anomaly outputs of non-anomalous classes by introducing text-modal using the CLIP text encoder. Indeed, MMRAS+ is the first multi-modal anomaly segmentation solution for autonomous driving. Moreover, we develop an ensemble module to further boost the anomaly segmentation performance. Experiments on RoadAnomaly, SMIYC, and Fishyscapes validation datasets demonstrate the superior performance of our method. The code is available in https://github.com/HengGao12/MMRAS_plus.

LGFeb 22, 2025Code
Detecting OOD Samples via Optimal Transport Scoring Function

Heng Gao, Zhuolin He, Jian Pu

To deploy machine learning models in the real world, researchers have proposed many OOD detection algorithms to help models identify unknown samples during the inference phase and prevent them from making untrustworthy predictions. Unlike methods that rely on extra data for outlier exposure training, post hoc methods detect Out-of-Distribution (OOD) samples by developing scoring functions, which are model agnostic and do not require additional training. However, previous post hoc methods may fail to capture the geometric cues embedded in network representations. Thus, in this study, we propose a novel score function based on the optimal transport theory, named OTOD, for OOD detection. We utilize information from features, logits, and the softmax probability space to calculate the OOD score for each test sample. Our experiments show that combining this information can boost the performance of OTOD with a certain margin. Experiments on the CIFAR-10 and CIFAR-100 benchmarks demonstrate the superior performance of our method. Notably, OTOD outperforms the state-of-the-art method GEN by 7.19% in the mean FPR@95 on the CIFAR-10 benchmark using ResNet-18 as the backbone, and by 12.51% in the mean FPR@95 using WideResNet-28 as the backbone. In addition, we provide theoretical guarantees for OTOD. The code is available in https://github.com/HengGao12/OTOD.

49.8CVMar 19
CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention

Jiacheng Tang, Zhiyuan Zhou, Zhuolin He et al.

Planning-oriented end-to-end driving models show great promise, yet they fundamentally learn statistical correlations instead of true causal relationships. This vulnerability leads to causal confusion, where models exploit dataset biases as shortcuts, critically harming their reliability and safety in complex scenarios. To address this, we introduce CausalVAD, a de-confounding training framework that leverages causal intervention. At its core, we design the sparse causal intervention scheme (SCIS), a lightweight, plug-and-play module to instantiate the backdoor adjustment theory in neural networks. SCIS constructs a dictionary of prototypes representing latent driving contexts. It then uses this dictionary to intervene on the model's sparse vectorized queries. This step actively eliminates spurious associations induced by confounders, thereby eliminating spurious factors from the representations for downstream tasks. Extensive experiments on benchmarks like nuScenes show CausalVAD achieves state-of-the-art planning accuracy and safety. Furthermore, our method demonstrates superior robustness against both data bias and noisy scenarios configured to induce causal confusion.

CVJun 25, 2024
Towards Camera Open-set 3D Object Detection for Autonomous Driving Scenarios

Zhuolin He, Xinrun Li, Jiacheng Tang et al.

Conventional camera-based 3D object detectors in autonomous driving are limited to recognizing a predefined set of objects, which poses a safety risk when encountering novel or unseen objects in real-world scenarios. To address this limitation, we present OS-Det3D, a two-stage training framework designed for camera-based open-set 3D object detection. In the first stage, our proposed 3D object discovery network (ODN3D) uses geometric cues from LiDAR point clouds to generate class-agnostic 3D object proposals, each of which are assigned a 3D objectness score. This approach allows the network to discover objects beyond known categories, allowing for the detection of unfamiliar objects. However, due to the absence of class constraints, ODN3D-generated proposals may include noisy data, particularly in cluttered or dynamic scenes. To mitigate this issue, we introduce a joint selection (JS) module in the second stage. The JS module uses both camera bird's eye view (BEV) feature responses and 3D objectness scores to filter out low-quality proposals, yielding high-quality pseudo ground truth for unknown objects. OS-Det3D significantly enhances the ability of camera 3D detectors to discover and identify unknown objects while also improving the performance on known objects, as demonstrated through extensive experiments on the nuScenes and KITTI datasets.