CVNov 28, 2022Code
SuperFusion: Multilevel LiDAR-Camera Fusion for Long-Range HD Map GenerationHao Dong, Weihao Gu, Xianjing Zhang et al. · tencent-ai
High-definition (HD) semantic map generation of the environment is an essential component of autonomous driving. Existing methods have achieved good performance in this task by fusing different sensor modalities, such as LiDAR and camera. However, current works are based on raw data or network feature-level fusion and only consider short-range HD map generation, limiting their deployment to realistic autonomous driving applications. In this paper, we focus on the task of building the HD maps in both short ranges, i.e., within 30 m, and also predicting long-range HD maps up to 90 m, which is required by downstream path planning and control tasks to improve the smoothness and safety of autonomous driving. To this end, we propose a novel network named SuperFusion, exploiting the fusion of LiDAR and camera data at multiple levels. We use LiDAR depth to improve image depth estimation and use image features to guide long-range LiDAR feature prediction. We benchmark our SuperFusion on the nuScenes dataset and a self-recorded dataset and show that it outperforms the state-of-the-art baseline methods with large margins on all intervals. Additionally, we apply the generated HD map to a downstream path planning task, demonstrating that the long-range HD maps predicted by our method can lead to better path planning for autonomous vehicles. Our code has been released at https://github.com/haomo-ai/SuperFusion.
CVFeb 21, 2023Code
Lightweight Real-time Semantic Segmentation Network with Efficient Transformer and CNNGuoan Xu, Juncheng Li, Guangwei Gao et al.
In the past decade, convolutional neural networks (CNNs) have shown prominence for semantic segmentation. Although CNN models have very impressive performance, the ability to capture global representation is still insufficient, which results in suboptimal results. Recently, Transformer achieved huge success in NLP tasks, demonstrating its advantages in modeling long-range dependency. Recently, Transformer has also attracted tremendous attention from computer vision researchers who reformulate the image processing tasks as a sequence-to-sequence prediction but resulted in deteriorating local feature details. In this work, we propose a lightweight real-time semantic segmentation network called LETNet. LETNet combines a U-shaped CNN with Transformer effectively in a capsule embedding style to compensate for respective deficiencies. Meanwhile, the elaborately designed Lightweight Dilated Bottleneck (LDB) module and Feature Enhancement (FE) module cultivate a positive impact on training from scratch simultaneously. Extensive experiments performed on challenging datasets demonstrate that LETNet achieves superior performances in accuracy and efficiency balance. Specifically, It only contains 0.95M parameters and 13.6G FLOPs but yields 72.8\% mIoU at 120 FPS on the Cityscapes test set and 70.5\% mIoU at 250 FPS on the CamVid test dataset using a single RTX 3090 GPU. The source code will be available at https://github.com/IVIPLab/LETNet.
CVMar 7, 2023Code
InsMOS: Instance-Aware Moving Object Segmentation in LiDAR DataNeng Wang, Chenghao Shi, Ruibin Guo et al.
Identifying moving objects is a crucial capability for autonomous navigation, consistent map generation, and future trajectory prediction of objects. In this paper, we propose a novel network that addresses the challenge of segmenting moving objects in 3D LiDAR scans. Our approach not only predicts point-wise moving labels but also detects instance information of main traffic participants. Such a design helps determine which instances are actually moving and which ones are temporarily static in the current scene. Our method exploits a sequence of point clouds as input and quantifies them into 4D voxels. We use 4D sparse convolutions to extract motion features from the 4D voxels and inject them into the current scan. Then, we extract spatio-temporal features from the current scan for instance detection and feature fusion. Finally, we design an upsample fusion module to output point-wise labels by fusing the spatio-temporal features and predicted instance information. We evaluated our approach on the LiDAR-MOS benchmark based on SemanticKITTI and achieved better moving object segmentation performance compared to state-of-the-art methods, demonstrating the effectiveness of our approach in integrating instance information for moving object segmentation. Furthermore, our method shows superior performance on the Apollo dataset with a pre-trained model on SemanticKITTI, indicating that our method generalizes well in different scenes.The code and pre-trained models of our method will be released at https://github.com/nubot-nudt/InsMOS.
ROSep 15, 2023Code
Fast and Accurate Deep Loop Closing and Relocalization for Reliable LiDAR SLAMChenghao Shi, Xieyuanli Chen, Junhao Xiao et al.
Loop closing and relocalization are crucial techniques to establish reliable and robust long-term SLAM by addressing pose estimation drift and degeneration. This article begins by formulating loop closing and relocalization within a unified framework. Then, we propose a novel multi-head network LCR-Net to tackle both tasks effectively. It exploits novel feature extraction and pose-aware attention mechanism to precisely estimate similarities and 6-DoF poses between pairs of LiDAR scans. In the end, we integrate our LCR-Net into a SLAM system and achieve robust and accurate online LiDAR SLAM in outdoor driving environments. We thoroughly evaluate our LCR-Net through three setups derived from loop closing and relocalization, including candidate retrieval, closed-loop point cloud registration, and continuous relocalization using multiple datasets. The results demonstrate that LCR-Net excels in all three tasks, surpassing the state-of-the-art methods and exhibiting a remarkable generalization ability. Notably, our LCR-Net outperforms baseline methods without using a time-consuming robust pose estimator, rendering it suitable for online SLAM applications. To our best knowledge, the integration of LCR-Net yields the first LiDAR SLAM with the capability of deep loop closing and relocalization. The implementation of our methods will be made open-source.
ROMar 8, 2023Code
ElC-OIS: Ellipsoidal Clustering for Open-World Instance Segmentation on LiDAR DataWenbang Deng, Kaihong Huang, Qinghua Yu et al.
Open-world Instance Segmentation (OIS) is a challenging task that aims to accurately segment every object instance appearing in the current observation, regardless of whether these instances have been labeled in the training set. This is important for safety-critical applications such as robust autonomous navigation. In this paper, we present a flexible and effective OIS framework for LiDAR point cloud that can accurately segment both known and unknown instances (i.e., seen and unseen instance categories during training). It first identifies points belonging to known classes and removes the background by leveraging close-set panoptic segmentation networks. Then, we propose a novel ellipsoidal clustering method that is more adapted to the characteristic of LiDAR scans and allows precise segmentation of unknown instances. Furthermore, a diffuse searching method is proposed to handle the common over-segmentation problem presented in the known instances. With the combination of these techniques, we are able to achieve accurate segmentation for both known and unknown instances. We evaluated our method on the SemanticKITTI open-world LiDAR instance segmentation dataset. The experimental results suggest that it outperforms current state-of-the-art methods, especially with a 10.0% improvement in association quality. The source code of our method will be publicly available at https://github.com/nubot-nudt/ElC-OIS.
CVMay 24, 2022
Thunder: Thumbnail based Fast Lightweight Image Denoising NetworkYifeng Zhou, Xing Xu, Shuaicheng Liu et al.
To achieve promising results on removing noise from real-world images, most of existing denoising networks are formulated with complex network structure, making them impractical for deployment. Some attempts focused on reducing the number of filters and feature channels but suffered from large performance loss, and a more practical and lightweight denoising network with fast inference speed is of high demand. To this end, a \textbf{Thu}mb\textbf{n}ail based \textbf{D}\textbf{e}noising Netwo\textbf{r}k dubbed Thunder, is proposed and implemented as a lightweight structure for fast restoration without comprising the denoising capabilities. Specifically, the Thunder model contains two newly-established modules: (1) a wavelet-based Thumbnail Subspace Encoder (TSE) which can leverage sub-bands correlation to provide an approximate thumbnail based on the low-frequent feature; (2) a Subspace Projection based Refine Module (SPR) which can restore the details for thumbnail progressively based on the subspace projection approach. Extensive experiments have been carried out on two real-world denoising benchmarks, demonstrating that the proposed Thunder outperforms the existing lightweight models and achieves competitive performance on PSNR and SSIM when compared with the complex designs.
CVMar 31, 2023
RDMNet: Reliable Dense Matching Based Point Cloud Registration for Autonomous DrivingChenghao Shi, Xieyuanli Chen, Huimin Lu et al.
Point cloud registration is an important task in robotics and autonomous driving to estimate the ego-motion of the vehicle. Recent advances following the coarse-to-fine manner show promising potential in point cloud registration. However, existing methods rely on good superpoint correspondences, which are hard to be obtained reliably and efficiently, thus resulting in less robust and accurate point cloud registration. In this paper, we propose a novel network, named RDMNet, to find dense point correspondences coarse-to-fine and improve final pose estimation based on such reliable correspondences. Our RDMNet uses a devised 3D-RoFormer mechanism to first extract distinctive superpoints and generates reliable superpoints matches between two point clouds. The proposed 3D-RoFormer fuses 3D position information into the transformer network, efficiently exploiting point clouds' contextual and geometric information to generate robust superpoint correspondences. RDMNet then propagates the sparse superpoints matches to dense point matches using the neighborhood information for accurate point cloud registration. We extensively evaluate our method on multiple datasets from different environments. The experimental results demonstrate that our method outperforms existing state-of-the-art approaches in all tested datasets with a strong generalization ability.
59.5ROMay 27
SPRINT: Efficient Spectral Priors for Humanoid Athletic SprintsYantong Wei, Kaihong Huang, Hainan Pan et al.
The pursuit of humanoid athletic sprints is hindered by a scarcity of humanoid-viable kinematic reference data and the inability of existing frameworks to maintain stability during sprints. To overcome these limitations, we introduce SPRINT, a novel framework driven by efficient, frequency-adaptive spectral priors. By characterizing the fundamental periodicity of human locomotion in the frequency domain using a reference library of five discrete motion sequences, these priors generate kinematically feasible joint trajectories across a broad velocity spectrum, successfully extrapolating to speeds that exceed the reference distribution. Guided by these pretrained priors, the SPRINT policy achieves zero-shot sim-to-real transfer in field experiments on the Unitree G1 platform, reaching a peak sprinting velocity of 6 m/s and demonstrating seamless gait transitions while preserving biomimetic naturalness. Ultimately, this work establishes frequency-adaptive spectral priors as a highly data-efficient foundation for humanoid athletic sprints. The project page is available at https://anonymous.4open.science/w/SPRINT-138A/.
70.0CVMar 13Code
IGASA: Integrated Geometry-Aware and Skip-Attention Modules for Enhanced Point Cloud RegistrationDongxu Zhang, Jihua Zhu, Shiqi Li et al.
Point cloud registration (PCR) is a fundamental task in 3D vision and provides essential support for applications such as autonomous driving, robotics, and environmental modeling. Despite its widespread use, existing methods often fail when facing real-world challenges like heavy noise, significant occlusions, and large-scale transformations. These limitations frequently result in compromised registration accuracy and insufficient robustness in complex environments. In this paper, we propose IGASA as a novel registration framework constructed upon a Hierarchical Pyramid Architecture (HPA) designed for robust multi-scale feature extraction and fusion. The framework integrates two pivotal components consisting of the Hierarchical Cross-Layer Attention (HCLA) module and the Iterative Geometry-Aware Refinement (IGAR) module. The HCLA module utilizes skip attention mechanisms to align multi-resolution features and enhance local geometric consistency. Simultaneously, the IGAR module is designed for the fine matching phase by leveraging reliable correspondences established during coarse matching. This synergistic integration within the architecture allows IGASA to adapt effectively to diverse point cloud structures and intricate transformations. We evaluate the performance of IGASA on four widely recognized benchmark datasets including 3D(Lo)Match, KITTI, and nuScenes. Our extensive experiments consistently demonstrate that IGASA significantly surpasses state-of-the-art methods and achieves notable improvements in registration accuracy. This work provides a robust foundation for advancing point cloud registration techniques while offering valuable insights for practical 3D vision applications. The code for IGASA is available in \href{https://github.com/DongXu-Zhang/IGASA}{https://github.com/DongXu-Zhang/IGASA}.
CVSep 1, 2024
Attention-Guided Multi-scale Interaction Network for Face Super-ResolutionXujie Wan, Wenjie Li, Guangwei Gao et al.
Recently, CNN and Transformer hybrid networks demonstrated excellent performance in face super-resolution (FSR) tasks. Since numerous features at different scales in hybrid networks, how to fuse these multiscale features and promote their complementarity is crucial for enhancing FSR. However, existing hybrid network-based FSR methods ignore this, only simply combining the Transformer and CNN. To address this issue, we propose an attention-guided Multiscale interaction network (AMINet), which incorporates local and global feature interactions, as well as encoder-decoder phase feature interactions. Specifically, we propose a Local and Global Feature Interaction Module (LGFI) to promote the fusion of global features and the local features extracted from different receptive fields by our Residual Depth Feature Extraction Module (RDFE). Additionally, we propose a Selective Kernel Attention Fusion Module (SKAF) to adaptively select fusions of different features within the LGFI and encoder-decoder phases. Our above design allows the free flow of multiscale features from within modules and between the encoder and decoder, which can promote the complementarity of different scale features to enhance FSR. Comprehensive experiments confirm that our method consistently performs well with less computational consumption and faster inference.
CLJul 24, 2024
Towards Transfer Unlearning: Empirical Evidence of Cross-Domain Bias MitigationHuimin Lu, Masaru Isonuma, Junichiro Mori et al.
Large language models (LLMs) often inherit biases from vast amounts of training corpora. Traditional debiasing methods, while effective to some extent, do not completely eliminate memorized biases and toxicity in LLMs. In this paper, we study an unlearning-based approach to debiasing in LLMs by performing gradient ascent on hate speech against minority groups, i.e., minimizing the likelihood of biased or toxic content. Specifically, we propose a mask language modeling unlearning technique, which unlearns the harmful part of the text. This method enables LLMs to selectively forget and disassociate from biased and harmful content. Experimental results demonstrate the effectiveness of our approach in diminishing bias while maintaining the language modeling abilities. Surprisingly, the results also unveil an unexpected potential for cross-domain transfer unlearning: debiasing in one bias form (e.g. gender) may contribute to mitigating others (e.g. race and religion).
39.2SPApr 3
A General Framework for Generative Self-supervised Learning in Non-invasive Estimation of Physiological Parameters Using PhotoplethysmographyZexing Zhang, Huimin Lu, Songzhe Ma et al.
Aligning physiological parameter labels with large-scale photoplethysmographic (PPG) data for deep learning is challenging and resource-intensive. While self-supervised representation learning (SSRL) can handle limited annotated data, the challenge lies in learning robust shared representations from vast unlabeled data and integrating contextual cues to learn distinctive representations. To alleviate these challenges, a generative SSRL framework TS2TC is proposed to utilize the temporal, spectrogram, and temporal-spectrogram mixed domains to explore and incorporate the unique features of PPG for universal and noninvasive physiological parameter estimation. A pretext task named Cross-Temporal Fusion Generative Anchor (CTFGA) is designed, modeling temporal dependencies and reconstructing independent segments at a coarse level to provide robust global feature extraction and local contextual representation. The framework includes sub-signals from PPG with diverse frequency scales and order derivatives reflecting hemodynamics to facilitate learning shared representations at varying semantic levels. Secondly, a cognitive-inspired dual-process transfer (DPT) strategy is formulated, consisting of prior-dependent autonomous processes and posterior observation reasoning processes, to leverage the independent and integrated advantages of shared and specific representations. TS2TC introduces a bilinear temporal-spectrogram fusion method in the mixed domain, aligning latent representations from different domains and establishing fine-grained contextual interactions across multiple sources of information. Extensive experiments on physiological parameter estimation tasks showed that the joint performance of CTFGA and DPT outperforms standard generative learning significantly. TS2TC achieved an average 2.49\% improvement in RMSE over state-of-the-art estimation methods with only 10\% training data.
CVMar 16, 2025Code
ResLPR: A LiDAR Data Restoration Network and Benchmark for Robust Place Recognition Against Weather CorruptionsWenqing Kuang, Xiongwei Zhao, Yehui Shen et al.
LiDAR-based place recognition (LPR) is a key component for autonomous driving, and its resilience to environmental corruption is critical for safety in high-stakes applications. While state-of-the-art (SOTA) LPR methods perform well in clean weather, they still struggle with weather-induced corruption commonly encountered in driving scenarios. To tackle this, we propose ResLPRNet, a novel LiDAR data restoration network that largely enhances LPR performance under adverse weather by restoring corrupted LiDAR scans using a wavelet transform-based network. ResLPRNet is efficient, lightweight and can be integrated plug-and-play with pretrained LPR models without substantial additional computational cost. Given the lack of LPR datasets under adverse weather, we introduce ResLPR, a novel benchmark that examines SOTA LPR methods under a wide range of LiDAR distortions induced by severe snow, fog, and rain conditions. Experiments on our proposed WeatherKITTI and WeatherNCLT datasets demonstrate the resilience and notable gains achieved by using our restoration method with multiple LPR approaches in challenging weather scenarios. Our code and benchmark are publicly available here: https://github.com/nubot-nudt/ResLPR.
ROMar 14, 2025Code
BEVDiffLoc: End-to-End LiDAR Global Localization in BEV View based on Diffusion ModelZiyue Wang, Chenghao Shi, Neng Wang et al.
Localization is one of the core parts of modern robotics. Classic localization methods typically follow the retrieve-then-register paradigm, achieving remarkable success. Recently, the emergence of end-to-end localization approaches has offered distinct advantages, including a streamlined system architecture and the elimination of the need to store extensive map data. Although these methods have demonstrated promising results, current end-to-end localization approaches still face limitations in robustness and accuracy. Bird's-Eye-View (BEV) image is one of the most widely adopted data representations in autonomous driving. It significantly reduces data complexity while preserving spatial structure and scale consistency, making it an ideal representation for localization tasks. However, research on BEV-based end-to-end localization remains notably insufficient. To fill this gap, we propose BEVDiffLoc, a novel framework that formulates LiDAR localization as a conditional generation of poses. Leveraging the properties of BEV, we first introduce a specific data augmentation method to significantly enhance the diversity of input data. Then, the Maximum Feature Aggregation Module and Vision Transformer are employed to learn robust features while maintaining robustness against significant rotational view variations. Finally, we incorporate a diffusion model that iteratively refines the learned features to recover the absolute pose. Extensive experiments on the Oxford Radar RobotCar and NCLT datasets demonstrate that BEVDiffLoc outperforms the baseline methods. Our code is available at https://github.com/nubot-nudt/BEVDiffLoc.
CVMar 14, 2025Code
A Novel Decomposed Feature-Oriented Framework for Open-Set Semantic Segmentation on LiDAR DataWenbang Deng, Xieyuanli Chen, Qinghua Yu et al.
Semantic segmentation is a key technique that enables mobile robots to understand and navigate surrounding environments autonomously. However, most existing works focus on segmenting known objects, overlooking the identification of unknown classes, which is common in real-world applications. In this paper, we propose a feature-oriented framework for open-set semantic segmentation on LiDAR data, capable of identifying unknown objects while retaining the ability to classify known ones. We design a decomposed dual-decoder network to simultaneously perform closed-set semantic segmentation and generate distinctive features for unknown objects. The network is trained with multi-objective loss functions to capture the characteristics of known and unknown objects. Using the extracted features, we introduce an anomaly detection mechanism to identify unknown objects. By integrating the results of close-set semantic segmentation and anomaly detection, we achieve effective feature-driven LiDAR open-set semantic segmentation. Evaluations on both SemanticKITTI and nuScenes datasets demonstrate that our proposed framework significantly outperforms state-of-the-art methods. The source code will be made publicly available at https://github.com/nubot-nudt/DOSS.
CVApr 2, 2024Code
TSCM: A Teacher-Student Model for Vision Place Recognition Using Cross-Metric Knowledge DistillationYehui Shen, Mingmin Liu, Huimin Lu et al.
Visual place recognition (VPR) plays a pivotal role in autonomous exploration and navigation of mobile robots within complex outdoor environments. While cost-effective and easily deployed, camera sensors are sensitive to lighting and weather changes, and even slight image alterations can greatly affect VPR efficiency and precision. Existing methods overcome this by exploiting powerful yet large networks, leading to significant consumption of computational resources. In this paper, we propose a high-performance teacher and lightweight student distillation framework called TSCM. It exploits our devised cross-metric knowledge distillation to narrow the performance gap between the teacher and student models, maintaining superior performance while enabling minimal computational load during deployment. We conduct comprehensive evaluations on large-scale datasets, namely Pittsburgh30k and Pittsburgh250k. Experimental results demonstrate the superiority of our method over baseline models in terms of recognition accuracy and model parameter efficiency. Moreover, our ablation studies show that the proposed knowledge distillation technique surpasses other counterparts. The code of our method has been released at https://github.com/nubot-nudt/TSCM.
CVMar 17, 2025Code
Efficient Multimodal 3D Object Detector via Instance-Level Contrastive DistillationZhuoqun Su, Huimin Lu, Shuaifeng Jiao et al.
Multimodal 3D object detectors leverage the strengths of both geometry-aware LiDAR point clouds and semantically rich RGB images to enhance detection performance. However, the inherent heterogeneity between these modalities, including unbalanced convergence and modal misalignment, poses significant challenges. Meanwhile, the large size of the detection-oriented feature also constrains existing fusion strategies to capture long-range dependencies for the 3D detection tasks. In this work, we introduce a fast yet effective multimodal 3D object detector, incorporating our proposed Instance-level Contrastive Distillation (ICD) framework and Cross Linear Attention Fusion Module (CLFM). ICD aligns instance-level image features with LiDAR representations through object-aware contrastive distillation, ensuring fine-grained cross-modal consistency. Meanwhile, CLFM presents an efficient and scalable fusion strategy that enhances cross-modal global interactions within sizable multimodal BEV features. Extensive experiments on the KITTI and nuScenes 3D object detection benchmarks demonstrate the effectiveness of our methods. Notably, our 3D object detector outperforms state-of-the-art (SOTA) methods while achieving superior efficiency. The implementation of our method has been released as open-source at: https://github.com/nubot-nudt/ICD-Fusion.
ROMar 14, 2025Code
Image-Goal Navigation Using Refined Feature Guidance and Scene Graph EnhancementZhicheng Feng, Xieyuanli Chen, Chenghao Shi et al.
In this paper, we introduce a novel image-goal navigation approach, named RFSG. Our focus lies in leveraging the fine-grained connections between goals, observations, and the environment within limited image data, all the while keeping the navigation architecture simple and lightweight. To this end, we propose the spatial-channel attention mechanism, enabling the network to learn the importance of multi-dimensional features to fuse the goal and observation features. In addition, a selfdistillation mechanism is incorporated to further enhance the feature representation capabilities. Given that the navigation task needs surrounding environmental information for more efficient navigation, we propose an image scene graph to establish feature associations at both the image and object levels, effectively encoding the surrounding scene information. Crossscene performance validation was conducted on the Gibson and HM3D datasets, and the proposed method achieved stateof-the-art results among mainstream methods, with a speed of up to 53.5 frames per second on an RTX3080. This contributes to the realization of end-to-end image-goal navigation in realworld scenarios. The implementation and model of our method have been released at: https://github.com/nubot-nudt/RFSG.
CVJun 24, 2024Code
SegNet4D: Efficient Instance-Aware 4D Semantic Segmentation for LiDAR Point CloudNeng Wang, Ruibin Guo, Chenghao Shi et al.
4D LiDAR semantic segmentation, also referred to as multi-scan semantic segmentation, plays a crucial role in enhancing the environmental understanding capabilities of autonomous vehicles or robots. It classifies the semantic category of each LiDAR measurement point and detects whether it is dynamic, a critical ability for tasks like obstacle avoidance and autonomous navigation. Existing approaches often rely on computationally heavy 4D convolutions or recursive networks, which result in poor real-time performance, making them unsuitable for online robotics and autonomous driving applications. In this paper, we introduce SegNet4D, a novel real-time 4D semantic segmentation network offering both efficiency and strong semantic understanding. SegNet4D addresses 4D segmentation as two tasks: single-scan semantic segmentation and moving object segmentation, each tackled by a separate network head. Both results are combined in a motion-semantic fusion module to achieve comprehensive 4D segmentation. Additionally, instance information is extracted from the current scan and exploited for instance-wise segmentation consistency. Our approach surpasses state-of-the-art in both multi-scan semantic segmentation and moving object segmentation while offering greater efficiency, enabling real-time operation. Besides, its effectiveness and efficiency have also been validated on a real-world unmanned ground platform. Our code will be released at https://github.com/nubot-nudt/SegNet4D.
CVMar 27, 2025Code
UGNA-VPR: A Novel Training Paradigm for Visual Place Recognition Based on Uncertainty-Guided NeRF AugmentationYehui Shen, Lei Zhang, Qingqiu Li et al.
Visual place recognition (VPR) is crucial for robots to identify previously visited locations, playing an important role in autonomous navigation in both indoor and outdoor environments. However, most existing VPR datasets are limited to single-viewpoint scenarios, leading to reduced recognition accuracy, particularly in multi-directional driving or feature-sparse scenes. Moreover, obtaining additional data to mitigate these limitations is often expensive. This paper introduces a novel training paradigm to improve the performance of existing VPR networks by enhancing multi-view diversity within current datasets through uncertainty estimation and NeRF-based data augmentation. Specifically, we initially train NeRF using the existing VPR dataset. Then, our devised self-supervised uncertainty estimation network identifies places with high uncertainty. The poses of these uncertain places are input into NeRF to generate new synthetic observations for further training of VPR networks. Additionally, we propose an improved storage method for efficient organization of augmented and original training data. We conducted extensive experiments on three datasets and tested three different VPR backbone networks. The results demonstrate that our proposed training paradigm significantly improves VPR performance by fully utilizing existing data, outperforming other training approaches. We further validated the effectiveness of our approach on self-recorded indoor and outdoor datasets, consistently demonstrating superior results. Our dataset and code have been released at \href{https://github.com/nubot-nudt/UGNA-VPR}{https://github.com/nubot-nudt/UGNA-VPR}.
CVMar 14, 2025Code
LuSeg: Efficient Negative and Positive Obstacles Segmentation via Contrast-Driven Multi-Modal Feature Fusion on the LunarShuaifeng Jiao, Zhiwen Zeng, Zhuoqun Su et al.
As lunar exploration missions grow increasingly complex, ensuring safe and autonomous rover-based surface exploration has become one of the key challenges in lunar exploration tasks. In this work, we have developed a lunar surface simulation system called the Lunar Exploration Simulator System (LESS) and the LunarSeg dataset, which provides RGB-D data for lunar obstacle segmentation that includes both positive and negative obstacles. Additionally, we propose a novel two-stage segmentation network called LuSeg. Through contrastive learning, it enforces semantic consistency between the RGB encoder from Stage I and the depth encoder from Stage II. Experimental results on our proposed LunarSeg dataset and additional public real-world NPO road obstacle dataset demonstrate that LuSeg achieves state-of-the-art segmentation performance for both positive and negative obstacles while maintaining a high inference speed of approximately 57\,Hz. We have released the implementation of our LESS system, LunarSeg dataset, and the code of LuSeg at:https://github.com/nubot-nudt/LuSeg.
CVMar 2, 2025Code
Delving into Out-of-Distribution Detection with Medical Vision-Language ModelsLie Ju, Sijin Zhou, Yukun Zhou et al.
Recent advances in medical vision-language models (VLMs) demonstrate impressive performance in image classification tasks, driven by their strong zero-shot generalization capabilities. However, given the high variability and complexity inherent in medical imaging data, the ability of these models to detect out-of-distribution (OOD) data in this domain remains underexplored. In this work, we conduct the first systematic investigation into the OOD detection potential of medical VLMs. We evaluate state-of-the-art VLM-based OOD detection methods across a diverse set of medical VLMs, including both general and domain-specific purposes. To accurately reflect real-world challenges, we introduce a cross-modality evaluation pipeline for benchmarking full-spectrum OOD detection, rigorously assessing model robustness against both semantic shifts and covariate shifts. Furthermore, we propose a novel hierarchical prompt-based method that significantly enhances OOD detection performance. Extensive experiments are conducted to validate the effectiveness of our approach. The codes are available at https://github.com/PyJulie/Medical-VLMs-OOD-Detection.
CVDec 16, 2021Code
Feature Distillation Interaction Weighting Network for Lightweight Image Super-ResolutionGuangwei Gao, Wenjie Li, Juncheng Li et al.
Convolutional neural networks based single-image super-resolution (SISR) has made great progress in recent years. However, it is difficult to apply these methods to real-world scenarios due to the computational and memory cost. Meanwhile, how to take full advantage of the intermediate features under the constraints of limited parameters and calculations is also a huge challenge. To alleviate these issues, we propose a lightweight yet efficient Feature Distillation Interaction Weighted Network (FDIWN). Specifically, FDIWN utilizes a series of specially designed Feature Shuffle Weighted Groups (FSWG) as the backbone, and several novel mutual Wide-residual Distillation Interaction Blocks (WDIB) form an FSWG. In addition, Wide Identical Residual Weighting (WIRW) units and Wide Convolutional Residual Weighting (WCRW) units are introduced into WDIB for better feature distillation. Moreover, a Wide-Residual Distillation Connection (WRDC) framework and a Self-Calibration Fusion (SCF) unit are proposed to interact features with different scales more flexibly and efficiently.Extensive experiments show that our FDIWN is superior to other models to strike a good balance between model performance and efficiency. The code is available at https://github.com/IVIPLab/FDIWN.
CVSep 2, 2021Code
FBSNet: A Fast Bilateral Symmetrical Network for Real-Time Semantic SegmentationGuangwei Gao, Guoan Xu, Juncheng Li et al.
Real-time semantic segmentation, which can be visually understood as the pixel-level classification task on the input image, currently has broad application prospects, especially in the fast-developing fields of autonomous driving and drone navigation. However, the huge burden of calculation together with redundant parameters are still the obstacles to its technological development. In this paper, we propose a Fast Bilateral Symmetrical Network (FBSNet) to alleviate the above challenges. Specifically, FBSNet employs a symmetrical encoder-decoder structure with two branches, semantic information branch and spatial detail branch. The Semantic Information Branch (SIB) is the main branch with semantic architecture to acquire the contextual information of the input image and meanwhile acquire sufficient receptive field. While the Spatial Detail Branch (SDB) is a shallow and simple network used to establish local dependencies of each pixel for preserving details, which is essential for restoring the original resolution during the decoding phase. Meanwhile, a Feature Aggregation Module (FAM) is designed to effectively combine the output of these two branches. Experimental results of Cityscapes and CamVid show that the proposed FBSNet can strike a good balance between accuracy and efficiency. Specifically, it obtains 70.9\% and 68.9\% mIoU along with the inference speed of 90 fps and 120 fps on these two test datasets, respectively, with only 0.62 million parameters on a single RTX 2080Ti GPU. The code is available at https://github.com/IVIPLab/FBSNet.
43.5CVMar 14
DCP-CLIP:A Coarse-to-Fine Framework for Open-Vocabulary Semantic Segmentation with Dual InteractionJing Wang, Huimin Shi, Quan Zhou et al.
The recent years have witnessed the remarkable development for open-vocabulary semantic segmentation (OVSS) using visual-language foundation models, yet still suffer from following fundamental challenges: (1) insufficient cross-modal communications between textual and visual spaces, and (2) significant computational costs from the interactions with massive number of categories. To address these issues, this paper describes a novel coarse-to-fine framework, called DCP-CLIP, for OVSS. Unlike prior efforts that mainly relied on pre-established category content and the inherent spatial-class interaction capability of CLIP, we dynamic constructing category-relevant textual features and explicitly models dual interactions between spatial image features and textual class semantics. Specifically, we first leverage CLIP's open-vocabulary recognition capability to identify semantic categories relevant to the image context, upon which we dynamically generate corresponding textual features to serve as initial textual guidance. Subsequently, we conduct a coarse segmentation by cross-modally integrating semantic information from textual guidance into the visual representations and achieve refined segmentation by integrating spatially enriched features from the encoder to recover fine-grained details and enhance spatial resolution. In final, we leverage spatial information from the segmentation side to refine category predictions for each mask, facilitating more precise semantic labeling. Experiments on multiple OVSS benchmarks demonstrate that DCP-CLIP outperforms existing methods by delivering both higher accuracy and greater efficiency.
CVFeb 26
Exploring the AI Obedience: Why is Generating a Pure Color Image Harder than CyberPunk?Hongyu Li, Kuan Liu, Yuan Chen et al.
Recent advances in generative AI have demonstrated remarkable ability to produce high-quality content. However, these models often exhibit "Paradox of Simplicity": while they can render intricate landscapes, they often fail at simple, deterministic tasks. To address this, we formalize Obedience as the ability to align with instructions and establish a hierarchical grading system ranging from basic semantic alignment to pixel-level systemic precision, which provides a unified paradigm for incorporating and categorizing existing literature. Then, we conduct case studies to identify common obedience gaps, revealing how generative priors often override logical constraints. To evaluate high-level obedience, we present VIOLIN (VIsual Obedience Level-4 EvaluatIoN), the first benchmark focused on pure color generation across six variants. Extensive experiments on SOTA models reveal fundamental obedience limitations and further exploratory insights. By establishing this framework, we aim to draw more attention on AI Obedience and encourage deeper exploration to bridge this gap.
CVApr 9, 2024
Diffusion-Based Point Cloud Super-Resolution for mmWave Radar DataKai Luan, Chenghao Shi, Neng Wang et al.
The millimeter-wave radar sensor maintains stable performance under adverse environmental conditions, making it a promising solution for all-weather perception tasks, such as outdoor mobile robotics. However, the radar point clouds are relatively sparse and contain massive ghost points, which greatly limits the development of mmWave radar technology. In this paper, we propose a novel point cloud super-resolution approach for 3D mmWave radar data, named Radar-diffusion. Our approach employs the diffusion model defined by mean-reverting stochastic differential equations(SDE). Using our proposed new objective function with supervision from corresponding LiDAR point clouds, our approach efficiently handles radar ghost points and enhances the sparse mmWave radar point clouds to dense LiDAR-like point clouds. We evaluate our approach on two different datasets, and the experimental results show that our method outperforms the state-of-the-art baseline methods in 3D radar super-resolution tasks. Furthermore, we demonstrate that our enhanced radar point cloud is capable of downstream radar point-based registration tasks.
CLApr 29, 2025
UniDetox: Universal Detoxification of Large Language Models via Dataset DistillationHuimin Lu, Masaru Isonuma, Junichiro Mori et al.
We present UniDetox, a universally applicable method designed to mitigate toxicity across various large language models (LLMs). Previous detoxification methods are typically model-specific, addressing only individual models or model families, and require careful hyperparameter tuning due to the trade-off between detoxification efficacy and language modeling performance. In contrast, UniDetox provides a detoxification technique that can be universally applied to a wide range of LLMs without the need for separate model-specific tuning. Specifically, we propose a novel and efficient dataset distillation technique for detoxification using contrastive decoding. This approach distills detoxifying representations in the form of synthetic text data, enabling universal detoxification of any LLM through fine-tuning with the distilled text. Our experiments demonstrate that the detoxifying text distilled from GPT-2 can effectively detoxify larger models, including OPT, Falcon, and LLaMA-2. Furthermore, UniDetox eliminates the need for separate hyperparameter tuning for each model, as a single hyperparameter configuration can be seamlessly applied across different models. Additionally, analysis of the detoxifying text reveals a reduction in politically biased content, providing insights into the attributes necessary for effective detoxification of LLMs.
ROSep 10, 2025
Grasp Like Humans: Learning Generalizable Multi-Fingered Grasping from Human Proprioceptive Sensorimotor IntegrationCe Guo, Xieyuanli Chen, Zhiwen Zeng et al.
Tactile and kinesthetic perceptions are crucial for human dexterous manipulation, enabling reliable grasping of objects via proprioceptive sensorimotor integration. For robotic hands, even though acquiring such tactile and kinesthetic feedback is feasible, establishing a direct mapping from this sensory feedback to motor actions remains challenging. In this paper, we propose a novel glove-mediated tactile-kinematic perception-prediction framework for grasp skill transfer from human intuitive and natural operation to robotic execution based on imitation learning, and its effectiveness is validated through generalized grasping tasks, including those involving deformable objects. Firstly, we integrate a data glove to capture tactile and kinesthetic data at the joint level. The glove is adaptable for both human and robotic hands, allowing data collection from natural human hand demonstrations across different scenarios. It ensures consistency in the raw data format, enabling evaluation of grasping for both human and robotic hands. Secondly, we establish a unified representation of multi-modal inputs based on graph structures with polar coordinates. We explicitly integrate the morphological differences into the designed representation, enhancing the compatibility across different demonstrators and robotic hands. Furthermore, we introduce the Tactile-Kinesthetic Spatio-Temporal Graph Networks (TK-STGN), which leverage multidimensional subgraph convolutions and attention-based LSTM layers to extract spatio-temporal features from graph inputs to predict node-based states for each hand joint. These predictions are then mapped to final commands through a force-position hybrid mapping.
CVAug 23, 2025
Probabilistic Temporal Masked Attention for Cross-view Online Action DetectionLiping Xie, Yang Tan, Shicheng Jing et al.
As a critical task in video sequence classification within computer vision, Online Action Detection (OAD) has garnered significant attention. The sensitivity of mainstream OAD models to varying video viewpoints often hampers their generalization when confronted with unseen sources. To address this limitation, we propose a novel Probabilistic Temporal Masked Attention (PTMA) model, which leverages probabilistic modeling to derive latent compressed representations of video frames in a cross-view setting. The PTMA model incorporates a GRU-based temporal masked attention (TMA) cell, which leverages these representations to effectively query the input video sequence, thereby enhancing information interaction and facilitating autoregressive frame-level video analysis. Additionally, multi-view information can be integrated into the probabilistic modeling to facilitate the extraction of view-invariant features. Experiments conducted under three evaluation protocols: cross-subject (cs), cross-view (cv), and cross-subject-view (csv) show that PTMA achieves state-of-the-art performance on the DAHLIA, IKEA ASM, and Breakfast datasets.
ROMar 8
GeoLoco: Leveraging 3D Geometric Priors from Visual Foundation Model for Robust RGB-Only Humanoid LocomotionYufei Liu, Xieyuanli Chen, Hainan Pan et al.
The prevailing paradigm of perceptive humanoid locomotion relies heavily on active depth sensors. However, this depth-centric approach fundamentally discards the rich semantic and dense appearance cues of the visual world, severing low-level control from the high-level reasoning essential for general embodied intelligence. While monocular RGB offers a ubiquitous, information-dense alternative, end-to-end reinforcement learning from raw 2D pixels suffers from extreme sample inefficiency and catastrophic sim-to-real collapse due to the inherent loss of geometric scale. To break this deadlock, we propose GeoLoco, a purely RGB-driven locomotion framework that conceptualizes monocular images as high-dimensional 3D latent representations by harnessing the powerful geometric priors of a frozen, scale-aware Visual Foundation Model (VFM). Rather than naive feature concatenation, we design a proprioceptive-query multi-head cross-attention mechanism that dynamically attends to task-critical topological features conditioned on the robot's real-time gait phase. Crucially, to prevent the policy from overfitting to superficial textures, we introduce a dual-head auxiliary learning scheme. This explicit regularization forces the high-dimensional latent space to strictly align with the physical terrain geometry, ensuring robust zero-shot sim-to-real transfer. Trained exclusively in simulation, GeoLoco achieves robust zero-shot transfer to the Unitree G1 humanoid and successfully negotiates challenging terrains.
CVMar 25, 2021
JDSR-GAN: Constructing An Efficient Joint Learning Network for Masked Face Super-ResolutionGuangwei Gao, Lei Tang, Fei Wu et al.
With the growing importance of preventing the COVID-19 virus, face images obtained in most video surveillance scenarios are low resolution with mask simultaneously. However, most of the previous face super-resolution solutions can not handle both tasks in one model. In this work, we treat the mask occlusion as image noise and construct a joint and collaborative learning network, called JDSR-GAN, for the masked face super-resolution task. Given a low-quality face image with the mask as input, the role of the generator composed of a denoising module and super-resolution module is to acquire a high-quality high-resolution face image. The discriminator utilizes some carefully designed loss functions to ensure the quality of the recovered face images. Moreover, we incorporate the identity information and attention mechanism into our network for feasible correlated feature expression and informative feature learning. By jointly performing denoising and face super-resolution, the two tasks can complement each other and attain promising performance. Extensive qualitative and quantitative results show the superiority of our proposed JDSR-GAN over some comparable methods which perform the previous two tasks separately.
IVMar 24, 2021
Lightweight Image Super-Resolution with Multi-scale Feature Interaction NetworkZhengxue Wang, Guangwei Gao, Juncheng Li et al.
Recently, the single image super-resolution (SISR) approaches with deep and complex convolutional neural network structures have achieved promising performance. However, those methods improve the performance at the cost of higher memory consumption, which is difficult to be applied for some mobile devices with limited storage and computing resources. To solve this problem, we present a lightweight multi-scale feature interaction network (MSFIN). For lightweight SISR, MSFIN expands the receptive field and adequately exploits the informative features of the low-resolution observed images from various scales and interactive connections. In addition, we design a lightweight recurrent residual channel attention block (RRCAB) so that the network can benefit from the channel attention mechanism while being sufficiently lightweight. Extensive experiments on some benchmarks have confirmed that our proposed MSFIN can achieve comparable performance against the state-of-the-arts with a more lightweight model.
CVMar 20, 2021
3DMNDT:3D multi-view registration method based on the normal distributions transformJihua Zhu, Di Wang, Jiaxi Mu et al.
The normal distributions transform (NDT) is an effective paradigm for the point set registration. This method is originally designed for pair-wise registration and it will suffer from great challenges when applied to multi-view registration. Under the NDT framework, this paper proposes a novel multi-view registration method, named 3D multi-view registration based on the normal distributions transform (3DMNDT), which integrates the K-means clustering and Lie algebra solver to achieve multi-view registration. More specifically, the multi-view registration is cast into the problem of maximum likelihood estimation. Then, the K-means algorithm is utilized to divide all data points into different clusters, where a normal distribution is computed to locally models the probability of measuring a data point in each cluster. Subsequently, the registration problem is formulated by the NDT-based likelihood function. To maximize this likelihood function, the Lie algebra solver is developed to sequentially optimize each rigid transformation. The proposed method alternately implements data point clustering, NDT computing, and likelihood maximization until desired registration results are obtained. Experimental results tested on benchmark data sets illustrate that the proposed method can achieve state-of-the-art performance for multi-view registration.
ROFeb 26, 2021
Robot Navigation in a Crowd by Integrating Deep Reinforcement Learning and Online PlanningZhiqian Zhou, Pengming Zhu, Zhiwen Zeng et al.
It is still an open and challenging problem for mobile robots navigating along time-efficient and collision-free paths in a crowd. The main challenge comes from the complex and sophisticated interaction mechanism, which requires the robot to understand the crowd and perform proactive and foresighted behaviors. Deep reinforcement learning is a promising solution to this problem. However, most previous learning methods incur a tremendous computational burden. To address these problems, we propose a graph-based deep reinforcement learning method, SG-DQN, that (i) introduces a social attention mechanism to extract an efficient graph representation for the crowd-robot state; (ii) directly evaluates the coarse q-values of the raw state with a learned dueling deep Q network(DQN); and then (iii) refines the coarse q-values via online planning on possible future trajectories. The experimental results indicate that our model can help the robot better understand the crowd and achieve a high success rate of more than 0.99 in the crowd navigation task. Compared against previous state-of-the-art algorithms, our algorithm achieves an equivalent, if not better, performance while requiring less than half of the computational cost.
CVJun 15, 2020
ORD: Object Relationship Discovery for Visual Dialogue GenerationZiwei Wang, Zi Huang, Yadan Luo et al.
With the rapid advancement of image captioning and visual question answering at single-round level, the question of how to generate multi-round dialogue about visual content has not yet been well explored.Existing visual dialogue methods encode the image into a fixed feature vector directly, concatenated with the question and history embeddings to predict the response.Some recent methods tackle the co-reference resolution problem using co-attention mechanism to cross-refer relevant elements from the image, history, and the target question.However, it remains challenging to reason visual relationships, since the fine-grained object-level information is omitted before co-attentive reasoning. In this paper, we propose an object relationship discovery (ORD) framework to preserve the object interactions for visual dialogue generation. Specifically, a hierarchical graph convolutional network (HierGCN) is proposed to retain the object nodes and neighbour relationships locally, and then refines the object-object connections globally to obtain the final graph embeddings. A graph attention is further incorporated to dynamically attend to this graph-structured representation at the response reasoning stage. Extensive experiments have proved that the proposed method can significantly improve the quality of dialogue by utilising the contextual information of visual relationships. The model achieves superior performance over the state-of-the-art methods on the Visual Dialog dataset, increasing MRR from 0.6222 to 0.6447, and recall@1 from 48.48% to 51.22%.
CVApr 21, 2020
Robust Motion Averaging under Maximum Correntropy CriterionJihua Zhu, Jie Hu, Huimin Lu et al.
Recently, the motion averaging method has been introduced as an effective means to solve the multi-view registration problem. This method aims to recover global motions from a set of relative motions, where the original method is sensitive to outliers due to using the Frobenius norm error in the optimization. Accordingly, this paper proposes a novel robust motion averaging method based on the maximum correntropy criterion (MCC). Specifically, the correntropy measure is used instead of utilizing Frobenius norm error to improve the robustness of motion averaging against outliers. According to the half-quadratic technique, the correntropy measure based optimization problem can be solved by the alternating minimization procedure, which includes operations of weight assignment and weighted motion averaging. Further, we design a selection strategy of adaptive kernel width to take advantage of correntropy. Experimental results on benchmark data sets illustrate that the new method has superior performance on accuracy and robustness for multi-view registration.
LGApr 7, 2020
Generalized Label Enhancement with Sample CorrelationsQinghai Zheng, Jihua Zhu, Haoyu Tang et al.
Recently, label distribution learning (LDL) has drawn much attention in machine learning, where LDL model is learned from labelel instances. Different from single-label and multi-label annotations, label distributions describe the instance by multiple labels with different intensities and accommodate to more general scenes. Since most existing machine learning datasets merely provide logical labels, label distributions are unavailable in many real-world applications. To handle this problem, we propose two novel label enhancement methods, i.e., Label Enhancement with Sample Correlations (LESC) and generalized Label Enhancement with Sample Correlations (gLESC). More specifically, LESC employs a low-rank representation of samples in the feature space, and gLESC leverages a tensor multi-rank minimization to further investigate the sample correlations in both the feature space and label space. Benefitting from the sample correlations, the proposed methods can boost the performance of label enhancement. Extensive experiments on 14 benchmark datasets demonstrate the effectiveness and superiority of our methods.
CVMar 24, 2020
Synergic Adversarial Label Learning for Grading Retinal Diseases via Knowledge Distillation and Multi-task LearningLie Ju, Xin Wang, Xin Zhao et al.
The need for comprehensive and automated screening methods for retinal image classification has long been recognized. Well-qualified doctors annotated images are very expensive and only a limited amount of data is available for various retinal diseases such as age-related macular degeneration (AMD) and diabetic retinopathy (DR). Some studies show that AMD and DR share some common features like hemorrhagic points and exudation but most classification algorithms only train those disease models independently. Inspired by knowledge distillation where additional monitoring signals from various sources is beneficial to train a robust model with much fewer data. We propose a method called synergic adversarial label learning (SALL) which leverages relevant retinal disease labels in both semantic and feature space as additional signals and train the model in a collaborative manner. Our experiments on DR and AMD fundus image classification task demonstrate that the proposed method can significantly improve the accuracy of the model for grading diseases. In addition, we conduct additional experiments to show the effectiveness of SALL from the aspects of reliability and interpretability in the context of medical imaging application.
CVMar 21, 2020
BiCANet: Bi-directional Contextual Aggregating Network for Image Semantic SegmentationQuan Zhou, Dechun Cong, Bin Kang et al.
Exploring contextual information in convolution neural networks (CNNs) has gained substantial attention in recent years for semantic segmentation. This paper introduces a Bi-directional Contextual Aggregating Network, called BiCANet, for semantic segmentation. Unlike previous approaches that encode context in feature space, BiCANet aggregates contextual cues from a categorical perspective, which is mainly consist of three parts: contextual condensed projection block (CCPB), bi-directional context interaction block (BCIB), and muti-scale contextual fusion block (MCFB). More specifically, CCPB learns a category-based mapping through a split-transform-merge architecture, which condenses contextual cues with different receptive fields from intermediate layer. BCIB, on the other hand, employs dense skipped-connections to enhance the class-level context exchanging. Finally, MCFB integrates multi-scale contextual cues by investigating short- and long-ranged spatial dependencies. To evaluate BiCANet, we have conducted extensive experiments on three semantic segmentation datasets: PASCAL VOC 2012, Cityscapes, and ADE20K. The experimental results demonstrate that BiCANet outperforms recent state-of-the-art networks without any postprocess techniques. Particularly, BiCANet achieves the mIoU score of 86.7%, 82.4% and 38.66% on PASCAL VOC 2012, Cityscapes and ADE20K testset, respectively.
CVFeb 18, 2020
Registration of multi-view point sets under the perspective of expectation-maximizationJihua Zhu, Jing Zhang, Huimin Lu et al.
Registration of multi-view point sets is a prerequisite for 3D model reconstruction. To solve this problem, most of previous approaches either partially explore available information or blindly utilize unnecessary information to align each point set, which may lead to the undesired results or introduce extra computation complexity. To this end, this paper consider the multi-view registration problem as a maximum likelihood estimation problem and proposes a novel multi-view registration approach under the perspective of Expectation-Maximization (EM). The basic idea of our approach is that different data points are generated by the same number of Gaussian mixture models (GMMs). For each data point in one point set, its nearest neighbors can be searched from other well-aligned point sets. Then, we can suppose this data point is generated by the special GMM, which is composed of each nearest neighbor adhered with one Gaussian distribution. Based on this assumption, it is reasonable to define the likelihood function including all rigid transformations, which requires to be estimated for multi-view registration. Subsequently, the EM algorithm is utilized to maximize the likelihood function so as to estimate all rigid transformations. Finally, the proposed approach is tested on several bench mark data sets and compared with some state-of-the-art algorithms. Experimental results illustrate its super performance on accuracy, robustness and efficiency for the registration of multi-view point sets.
CVApr 5, 2019
Snap and Find: Deep Discrete Cross-domain Garment Image RetrievalYadan Luo, Ziwei Wang, Zi Huang et al.
With the increasing number of online stores, there is a pressing need for intelligent search systems to understand the item photos snapped by customers and search against large-scale product databases to find their desired items. However, it is challenging for conventional retrieval systems to match up the item photos captured by customers and the ones officially released by stores, especially for garment images. To bridge the customer- and store- provided garment photos, existing studies have been widely exploiting the clothing attributes (\textit{e.g.,} black) and landmarks (\textit{e.g.,} collar) to learn a common embedding space for garment representations. Unfortunately they omit the sequential correlation of attributes and consume large quantity of human labors to label the landmarks. In this paper, we propose a deep multi-task cross-domain hashing termed \textit{DMCH}, in which cross-domain embedding and sequential attribute learning are modeled simultaneously. Sequential attribute learning not only provides the semantic guidance for embedding, but also generates rich attention on discriminative local details (\textit{e.g.,} black buttons) of clothing items without requiring extra landmark labels. This leads to promising performance and 306$\times$ boost on efficiency when compared with the state-of-the-art models, which is demonstrated through rigorous experiments on two public fashion datasets.
CYJan 9, 2019
CONet: A Cognitive Ocean NetworkHuimin Lu, Dong Wang, Yujie Li et al.
The scientific and technological revolution of the Internet of Things has begun in the area of oceanography. Historically, humans have observed the ocean from an external viewpoint in order to study it. In recent years, however, changes have occurred in the ocean, and laboratories have been built on the seafloor. Approximately 70.8% of the Earth's surface is covered by oceans and rivers. The Ocean of Things is expected to be important for disaster prevention, ocean-resource exploration, and underwater environmental monitoring. Unlike traditional wireless sensor networks, the Ocean Network has its own unique features, such as low reliability and narrow bandwidth. These features will be great challenges for the Ocean Network. Furthermore, the integration of the Ocean Network with artificial intelligence has become a topic of increasing interest for oceanology researchers. The Cognitive Ocean Network (CONet) will become the mainstream of future ocean science and engineering developments. In this article, we define the CONet. The contributions of the paper are as follows: (1) a CONet architecture is proposed and described in detail; (2) important and useful demonstration applications of the CONet are proposed; and (3) future trends in CONet research are presented.
ROMay 14, 2018
Distributed Circumnavigation Control with Dynamic Spacings for a Heterogeneous Multi-robot SystemWeijia Yao, Sha Luo, Huimin Lu et al.
Circumnavigation control is useful in real-world applications such as entrapping a hostile target. In this paper, we consider a heterogeneous multi-robot system where robots have different physical properties, such as maximum movement speeds. Instead of equal-spacings, dynamic spacings according to robots' properties, which are termed utilities in this paper, will be more desirable in a scenario such as target entrapment. A distributed circumnavigation control algorithm based on utilities is proposed for any number of mobile robots from random 3D positions to circumnavigate a target. The dynamic spacings are subject to the variation of robots' utilities. The robots can only obtain the angular positions and utilities of their two neighbouring robots, so the control law is distributed. Theoretical analysis and experimental results are provided to prove the stability and effectiveness of the proposed control algorithm.
CVSep 25, 2017
Multi-level Chaotic Maps for 3D Textured Model EncryptionXin Jin, Shuyun Zhu, Le Wu et al.
With rapid progress of Virtual Reality and Augmented Reality technologies, 3D contents are the next widespread media in many applications. Thus, the protection of 3D models is primarily important. Encryption of 3D models is essential to maintain confidentiality. Previous work on encryption of 3D surface model often consider the point clouds, the meshes and the textures individually. In this work, a multi-level chaotic maps models for 3D textured encryption was presented by observing the different contributions for recognizing cipher 3D models between vertices (point cloud), polygons and textures. For vertices which make main contribution for recognizing, we use high level 3D Lu chaotic map to encrypt them. For polygons and textures which make relatively smaller contributions for recognizing, we use 2D Arnold's cat map and 1D Logistic map to encrypt them, respectively. The experimental results show that our method can get similar performance with the other method use the same high level chaotic map for point cloud, polygons and textures, while we use less time. Besides, our method can resist more method of attacks such as statistic attack, brute-force attack, correlation attack.
AIJun 14, 2017
Simultaneous merging multiple grid maps using the robust motion averagingZutao Jiang, Jihua Zhu, Yaochen Li et al.
Mapping in the GPS-denied environment is an important and challenging task in the field of robotics. In the large environment, mapping can be significantly accelerated by multiple robots exploring different parts of the environment. Accordingly, a key problem is how to integrate these local maps built by different robots into a single global map. In this paper, we propose an approach for simultaneous merging of multiple grid maps by the robust motion averaging. The main idea of this approach is to recover all global motions for map merging from a set of relative motions. Therefore, it firstly adopts the pair-wise map merging method to estimate relative motions for grid map pairs. To obtain as many reliable relative motions as possible, a graph-based sampling scheme is utilized to efficiently remove unreliable relative motions obtained from the pair-wise map merging. Subsequently, the accurate global motions can be recovered from the set of reliable relative motions by the motion averaging. Experimental results carried on real robot data sets demonstrate that proposed approach can achieve simultaneous merging of multiple grid maps with good performances.
CVJun 4, 2017
Brain Intelligence: Go Beyond Artificial IntelligenceHuimin Lu, Yujie Li, Min Chen et al.
Artificial intelligence (AI) is an important technology that supports daily social life and economic activities. It contributes greatly to the sustainable growth of Japan's economy and solves various social problems. In recent years, AI has attracted attention as a key for growth in developed countries such as Europe and the United States and developing countries such as China and India. The attention has been focused mainly on developing new artificial intelligence information communication technology (ICT) and robot technology (RT). Although recently developed AI technology certainly excels in extracting certain patterns, there are many limitations. Most ICT models are overly dependent on big data, lack a self-idea function, and are complicated. In this paper, rather than merely developing next-generation artificial intelligence technology, we aim to develop a new concept of general-purpose intelligence cognition technology called Beyond AI. Specifically, we plan to develop an intelligent learning model called Brain Intelligence (BI) that generates new ideas about events without having experienced them by using artificial life with an imagine function. We will also conduct demonstrations of the developed BI intelligence learning model on automatic driving, precision medical care, and industrial robots.
CVJun 1, 2017
An Effective Approach for Point Clouds Registration Based on the Hard and Soft AssignmentsCongcong Jin, Jihua Zhu, Yaochen Li et al.
For the registration of partially overlapping point clouds, this paper proposes an effective approach based on both the hard and soft assignments. Given two initially posed clouds, it firstly establishes the forward correspondence for each point in the data shape and calculates the value of binary variable, which can indicate whether this point correspondence is located in the overlapping areas or not. Then, it establishes the bilateral correspondence and computes bidirectional distances for each point in the overlapping areas. Based on the ratio of bidirectional distances, the exponential function is selected and utilized to calculate the probability value, which can indicate the reliability of the point correspondence. Subsequently, both the values of hard and soft assignments are embedded into the proposed objective function for registration of partially overlapping point clouds and a novel variant of ICP algorithm is proposed to obtain the optimal rigid transformation. The proposed approach can achieve good registration of point clouds, even when their overlap percentage is low. Experimental results tested on public data sets illustrate its superiority over previous approaches on accuracy and robustness.
CVApr 28, 2017
Effective scaling registration approach by imposing the emphasis on the scale factorMinmin Xu, Siyu Xu, Jihua Zhu et al.
This paper proposes an effective approach for the scaling registration of $m$-D point sets. Different from the rigid transformation, the scaling registration can not be formulated into the common least square function due to the ill-posed problem caused by the scale factor. Therefore, this paper designs a novel objective function for the scaling registration problem. The appearance of this objective function is a rational fraction, where the numerator item is the least square error and the denominator item is the square of the scale factor. By imposing the emphasis on scale factor, the ill-posed problem can be avoided in the scaling registration. Subsequently, the new objective function can be solved by the proposed scaling iterative closest point (ICP) algorithm, which can obtain the optimal scaling transformation. For the practical applications, the scaling ICP algorithm is further extended to align partially overlapping point sets. Finally, the proposed approach is tested on public data sets and applied to merging grid maps of different resolutions. Experimental results demonstrate its superiority over previous approaches on efficiency and robustness.
CVFeb 13, 2017
Underwater Optical Image Processing: A Comprehensive ReviewHuimin Lu, Yujie Li, Yudong Zhang et al.
Underwater cameras are widely used to observe the sea floor. They are usually included in autonomous underwater vehicles, unmanned underwater vehicles, and in situ ocean sensor networks. Despite being an important sensor for monitoring underwater scenes, there exist many issues with recent underwater camera sensors. Because of lights transportation characteristics in water and the biological activity at the sea floor, the acquired underwater images often suffer from scatters and large amounts of noise. Over the last five years, many methods have been proposed to overcome traditional underwater imaging problems. This paper aims to review the state-of-the-art techniques in underwater image processing by highlighting the contributions and challenges presented in over 40 papers. We present an overview of various underwater image processing approaches, such as underwater image descattering, underwater image color restoration, and underwater image quality assessments. Finally, we summarize the future trends and challenges in designing and processing underwater imaging sensors.