CVApr 3, 2023Code
3D Semantic Segmentation in the Wild: Learning Generalized Models for Adverse-Condition Point CloudsAoran Xiao, Jiaxing Huang, Weihao Xuan et al.
Robust point cloud parsing under all-weather conditions is crucial to level-5 autonomy in autonomous driving. However, how to learn a universal 3D semantic segmentation (3DSS) model is largely neglected as most existing benchmarks are dominated by point clouds captured under normal weather. We introduce SemanticSTF, an adverse-weather point cloud dataset that provides dense point-level annotations and allows to study 3DSS under various adverse weather conditions. We study all-weather 3DSS modeling under two setups: 1) domain adaptive 3DSS that adapts from normal-weather data to adverse-weather data; 2) domain generalizable 3DSS that learns all-weather 3DSS models from normal-weather data. Our studies reveal the challenge while existing 3DSS methods encounter adverse-weather data, showing the great value of SemanticSTF in steering the future endeavor along this very meaningful research direction. In addition, we design a domain randomization technique that alternatively randomizes the geometry styles of point clouds and aggregates their embeddings, ultimately leading to a generalizable model that can improve 3DSS under various adverse weather effectively. The SemanticSTF and related codes are available at \url{https://github.com/xiaoaoran/SemanticSTF}.
CVMar 21, 2023Code
SVCNet: Scribble-based Video Colorization Network with Temporal AggregationYuzhi Zhao, Lai-Man Po, Kangcheng Liu et al.
In this paper, we propose a scribble-based video colorization network with temporal aggregation called SVCNet. It can colorize monochrome videos based on different user-given color scribbles. It addresses three common issues in the scribble-based video colorization area: colorization vividness, temporal consistency, and color bleeding. To improve the colorization quality and strengthen the temporal consistency, we adopt two sequential sub-networks in SVCNet for precise colorization and temporal smoothing, respectively. The first stage includes a pyramid feature encoder to incorporate color scribbles with a grayscale frame, and a semantic feature encoder to extract semantics. The second stage finetunes the output from the first stage by aggregating the information of neighboring colorized frames (as short-range connections) and the first colorized frame (as a long-range connection). To alleviate the color bleeding artifacts, we learn video colorization and segmentation simultaneously. Furthermore, we set the majority of operations on a fixed small image resolution and use a Super-resolution Module at the tail of SVCNet to recover original sizes. It allows the SVCNet to fit different image resolutions at the inference. Finally, we evaluate the proposed SVCNet on DAVIS and Videvo benchmarks. The experimental results demonstrate that SVCNet produces both higher-quality and more temporally consistent videos than other well-known video colorization approaches. The codes and models can be found at https://github.com/zhaoyuzhi/SVCNet.
ROFeb 13, 2023Code
Learning-Based Defect Recognitions for Autonomous UAV InspectionsKangcheng Liu
Automatic crack detection and segmentation play a significant role in the whole system of unmanned aerial vehicle inspections. In this paper, we have implemented a deep learning framework for crack detection based on classical network architectures including Alexnet, VGG, and Resnet. Moreover, inspired by the feature pyramid network architecture, a hierarchical convolutional neural network (CNN) deep learning framework which is efficient in crack segmentation is also proposed, and its performance of it is compared with other state-of-the-art network architecture. We have summarized the existing crack detection and segmentation datasets and established the largest existing benchmark dataset on the internet for crack detection and segmentation, which is open-sourced for the research community. Our feature pyramid crack segmentation network is tested on the benchmark dataset and gives satisfactory segmentation results. A framework for automatic unmanned aerial vehicle inspections is also proposed and will be established for the crack inspection tasks of various concrete structures. All our self-established datasets and codes are open-sourced at: https://github.com/KangchengLiu/Crack-Detection-and-Segmentation-Dataset-for-UAV-Inspection
CVSep 6, 2022
Transformer-CNN Cohort: Semi-supervised Semantic Segmentation by the Best of Both StudentsXu Zheng, Yunhao Luo, Chong Fu et al.
The popular methods for semi-supervised semantic segmentation mostly adopt a unitary network model using convolutional neural networks (CNNs) and enforce consistency of the model's predictions over perturbations applied to the inputs or model. However, such a learning paradigm suffers from two critical limitations: a) learning the discriminative features for the unlabeled data; b) learning both global and local information from the whole image. In this paper, we propose a novel Semi-supervised Learning (SSL) approach, called Transformer-CNN Cohort (TCC), that consists of two students with one based on the vision transformer (ViT) and the other based on the CNN. Our method subtly incorporates the multi-level consistency regularization on the predictions and the heterogeneous feature spaces via pseudo-labeling for the unlabeled data. First, as the inputs of the ViT student are image patches, the feature maps extracted encode crucial class-wise statistics. To this end, we propose class-aware feature consistency distillation (CFCD) that first leverages the outputs of each student as the pseudo labels and generates class-aware feature (CF) maps for knowledge transfer between the two students. Second, as the ViT student has more uniform representations for all layers, we propose consistency-aware cross distillation (CCD) to transfer knowledge between the pixel-wise predictions from the cohort. We validate the TCC framework on Cityscapes and Pascal VOC 2012 datasets, which outperforms existing SSL methods by a large margin.
CVMar 11, 2023Code
Generalized 3D Self-supervised Learning Framework via Prompted Foreground-Aware Feature ContrastKangcheng Liu, Xinhu Zheng, Chaoqun Wang et al.
Contrastive learning has recently demonstrated great potential for unsupervised pre-training in 3D scene understanding tasks. However, most existing work randomly selects point features as anchors while building contrast, leading to a clear bias toward background points that often dominate in 3D scenes. Also, object awareness and foreground-to-background discrimination are neglected, making contrastive learning less effective. To tackle these issues, we propose a general foreground-aware feature contrast FAC++ framework to learn more effective point cloud representations in pre-training. FAC++ consists of two novel contrast designs to construct more effective and informative contrast pairs. The first is building positive pairs within the same foreground segment where points tend to have the same semantics. The second is that we prevent over-discrimination between 3D segments/objects and encourage grouped foreground-to-background distinctions at the segment level with adaptive feature learning in a Siamese correspondence network, which adaptively learns feature correlations within and across point cloud views effectively. Moreover, we have designed the foreground-prompted regional sampling to enhance more balanced foreground-aware learning, which is termed FAC++. Visualization with point activation maps shows that our contrast pairs capture clear correspondences among foreground regions during pre-training. Quantitative experiments also show that FAC++ achieves superior knowledge transfer and data efficiency in various downstream 3D semantic segmentation, instance segmentation as well as object detection tasks. All codes, data, and models are available at: https://github.com/KangchengLiu/FAC_Foreground_Aware_Contrast
CVNov 28, 2022
Semi-Supervised Confidence-Level-based Contrastive Discrimination for Class-Imbalanced Semantic SegmentationKangcheng Liu
To overcome the data-hungry challenge, we have proposed a semi-supervised contrastive learning framework for the task of class-imbalanced semantic segmentation. First and foremost, to make the model operate in a semi-supervised manner, we proposed the confidence-level-based contrastive learning to achieve instance discrimination in an explicit manner, and make the low-confidence low-quality features align with the high-confidence counterparts. Moreover, to tackle the problem of class imbalance in crack segmentation and road components extraction, we proposed the data imbalance loss to replace the traditional cross entropy loss in pixel-level semantic segmentation. Finally, we have also proposed an effective multi-stage fusion network architecture to improve semantic segmentation performance. Extensive experiments on the real industrial crack segmentation and the road segmentation demonstrate the superior effectiveness of the proposed framework. Our proposed method can provide satisfactory segmentation results with even merely 3.5% labeled data.
CVNov 14, 2025Code
VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language ModelsMingjie Xu, Jinpeng Chen, Yuzhi Zhao et al.
Multimodal large language models (MLLMs) have enabled a wide range of advanced vision-language applications, including fine-grained object recognition and contextual understanding. When querying specific regions or objects in an image, human users naturally use "visual prompts" (VPs), such as bounding boxes, to provide reference. However, no existing benchmark systematically evaluates the ability of MLLMs to interpret such VPs. This gap leaves it unclear whether current MLLMs can effectively recognize VPs, an intuitive prompting method for humans, and use them to solve problems. To address this limitation, we introduce VP-Bench, a benchmark for assessing MLLMs' capability in VP perception and utilization. VP-Bench employs a two-stage evaluation framework: Stage 1 examines models' ability to perceive VPs in natural scenes, using 30k visualized prompts spanning eight shapes and 355 attribute combinations. Stage 2 investigates the impact of VPs on downstream tasks, measuring their effectiveness in real-world problem-solving scenarios. Using VP-Bench, we evaluate 28 MLLMs, including proprietary systems (e.g., GPT-4o) and open-source models (e.g., InternVL3 and Qwen2.5-VL), and provide a comprehensive analysis of factors that affect VP understanding, such as variations in VP attributes, question arrangement, and model scale. VP-Bench establishes a new reference framework for studying how MLLMs comprehend and resolve grounded referring questions.
CLSep 15, 2025Code
AesBiasBench: Evaluating Bias and Alignment in Multimodal Language Models for Personalized Image Aesthetic AssessmentKun Li, Lai-Man Po, Hongzheng Yang et al.
Multimodal Large Language Models (MLLMs) are increasingly applied in Personalized Image Aesthetic Assessment (PIAA) as a scalable alternative to expert evaluations. However, their predictions may reflect subtle biases influenced by demographic factors such as gender, age, and education. In this work, we propose AesBiasBench, a benchmark designed to evaluate MLLMs along two complementary dimensions: (1) stereotype bias, quantified by measuring variations in aesthetic evaluations across demographic groups; and (2) alignment between model outputs and genuine human aesthetic preferences. Our benchmark covers three subtasks (Aesthetic Perception, Assessment, Empathy) and introduces structured metrics (IFD, NRD, AAS) to assess both bias and alignment. We evaluate 19 MLLMs, including proprietary models (e.g., GPT-4o, Claude-3.5-Sonnet) and open-source models (e.g., InternVL-2.5, Qwen2.5-VL). Results indicate that smaller models exhibit stronger stereotype biases, whereas larger models align more closely with human preferences. Incorporating identity information often exacerbates bias, particularly in emotional judgments. These findings underscore the importance of identity-aware evaluation frameworks in subjective vision-language tasks.
CVApr 6, 2024
Salient Sparse Visual Odometry With Pose-Only SupervisionSiyu Chen, Kangcheng Liu, Chen Wang et al.
Visual Odometry (VO) is vital for the navigation of autonomous systems, providing accurate position and orientation estimates at reasonable costs. While traditional VO methods excel in some conditions, they struggle with challenges like variable lighting and motion blur. Deep learning-based VO, though more adaptable, can face generalization problems in new environments. Addressing these drawbacks, this paper presents a novel hybrid visual odometry (VO) framework that leverages pose-only supervision, offering a balanced solution between robustness and the need for extensive labeling. We propose two cost-effective and innovative designs: a self-supervised homographic pre-training for enhancing optical flow learning from pose-only labels and a random patch-based salient point detection strategy for more accurate optical flow patch extraction. These designs eliminate the need for dense optical flow labels for training and significantly improve the generalization capability of the system in diverse and challenging environments. Our pose-only supervised method achieves competitive performance on standard datasets and greater robustness and generalization ability in extreme and unseen scenarios, even compared to dense optical flow-supervised state-of-the-art methods.
CVDec 3, 2023
A Review and A Robust Framework of Data-Efficient 3D Scene Parsing with Traditional/Learned 3D DescriptorsKangcheng Liu
Existing state-of-the-art 3D point cloud understanding methods merely perform well in a fully supervised manner. To the best of our knowledge, there exists no unified framework that simultaneously solves the downstream high-level understanding tasks including both segmentation and detection, especially when labels are extremely limited. This work presents a general and simple framework to tackle point cloud understanding when labels are limited. The first contribution is that we have done extensive methodology comparisons of traditional and learned 3D descriptors for the task of weakly supervised 3D scene understanding, and validated that our adapted traditional PFH-based 3D descriptors show excellent generalization ability across different domains. The second contribution is that we proposed a learning-based region merging strategy based on the affinity provided by both the traditional/learned 3D descriptors and learned semantics. The merging process takes both low-level geometric and high-level semantic feature correlations into consideration. Experimental results demonstrate that our framework has the best performance among the three most important weakly supervised point clouds understanding tasks including semantic segmentation, instance segmentation, and object detection even when very limited number of points are labeled. Our method, termed Region Merging 3D (RM3D), has superior performance on ScanNet data-efficient learning online benchmarks and other four large-scale 3D understanding benchmarks under various experimental settings, outperforming current arts by a margin for various 3D understanding tasks without complicated learning strategies such as active learning.
CVDec 3, 2023
A Data-efficient Framework for Robotics Large-scale LiDAR Scene ParsingKangcheng Liu
Existing state-of-the-art 3D point clouds understanding methods only perform well in a fully supervised manner. To the best of our knowledge, there exists no unified framework which simultaneously solves the downstream high-level understanding tasks, especially when labels are extremely limited. This work presents a general and simple framework to tackle point clouds understanding when labels are limited. We propose a novel unsupervised region expansion based clustering method for generating clusters. More importantly, we innovatively propose to learn to merge the over-divided clusters based on the local low-level geometric property similarities and the learned high-level feature similarities supervised by weak labels. Hence, the true weak labels guide pseudo labels merging taking both geometric and semantic feature correlations into consideration. Finally, the self-supervised reconstruction and data augmentation optimization modules are proposed to guide the propagation of labels among semantically similar points within a scene. Experimental Results demonstrate that our framework has the best performance among the three most important weakly supervised point clouds understanding tasks including semantic segmentation, instance segmentation, and object detection even when limited points are labeled, under the data-efficient settings for the large-scale 3D semantic scene parsing. The developed techniques have postentials to be applied to downstream tasks for better representations in robotic manipulation and robotic autonomous navigation. Codes and models are publicly available at: https://github.com/KangchengLiu.
CVDec 17, 2020
FG-Net: Fast Large-Scale LiDAR Point Clouds Understanding Network Leveraging Correlated Feature Mining and Geometric-Aware ModellingKangcheng Liu, Zhi Gao, Feng Lin et al.
This work presents FG-Net, a general deep learning framework for large-scale point clouds understanding without voxelizations, which achieves accurate and real-time performance with a single NVIDIA GTX 1080 GPU. First, a novel noise and outlier filtering method is designed to facilitate subsequent high-level tasks. For effective understanding purpose, we propose a deep convolutional neural network leveraging correlated feature mining and deformable convolution based geometric-aware modelling, in which the local feature relationships and geometric patterns can be fully exploited. For the efficiency issue, we put forward an inverse density sampling operation and a feature pyramid based residual learning strategy to save the computational cost and memory consumption respectively. Extensive experiments on real-world challenging datasets demonstrated that our approaches outperform state-of-the-art approaches in terms of accuracy and efficiency. Moreover, weakly supervised transfer learning is also conducted to demonstrate the generalization capacity of our method.