CVSep 25, 2023
AsymFormer: Asymmetrical Cross-Modal Representation Learning for Mobile Platform Real-Time RGB-D Semantic SegmentationSiqi Du, Weixi Wang, Renzhong Guo et al.
Understanding indoor scenes is crucial for urban studies. Considering the dynamic nature of indoor environments, effective semantic segmentation requires both real-time operation and high accuracy.To address this, we propose AsymFormer, a novel network that improves real-time semantic segmentation accuracy using RGB-D multi-modal information without substantially increasing network complexity. AsymFormer uses an asymmetrical backbone for multimodal feature extraction, reducing redundant parameters by optimizing computational resource distribution. To fuse asymmetric multimodal features, a Local Attention-Guided Feature Selection (LAFS) module is used to selectively fuse features from different modalities by leveraging their dependencies. Subsequently, a Cross-Modal Attention-Guided Feature Correlation Embedding (CMA) module is introduced to further extract cross-modal representations. The AsymFormer demonstrates competitive results with 54.1% mIoU on NYUv2 and 49.1% mIoU on SUNRGBD. Notably, AsymFormer achieves an inference speed of 65 FPS (79 FPS after implementing mixed precision quantization) on RTX3090, demonstrating that AsymFormer can strike a balance between high accuracy and efficiency.
CVJul 21, 2023
Building3D: An Urban-Scale Dataset and Benchmarks for Learning Roof Structures from Point CloudsRuisheng Wang, Shangfeng Huang, Hongxin Yang
Urban modeling from LiDAR point clouds is an important topic in computer vision, computer graphics, photogrammetry and remote sensing. 3D city models have found a wide range of applications in smart cities, autonomous navigation, urban planning and mapping etc. However, existing datasets for 3D modeling mainly focus on common objects such as furniture or cars. Lack of building datasets has become a major obstacle for applying deep learning technology to specific domains such as urban modeling. In this paper, we present a urban-scale dataset consisting of more than 160 thousands buildings along with corresponding point clouds, mesh and wire-frame models, covering 16 cities in Estonia about 998 Km2. We extensively evaluate performance of state-of-the-art algorithms including handcrafted and deep feature based methods. Experimental results indicate that Building3D has challenges of high intra-class variance, data imbalance and large-scale noises. The Building3D is the first and largest urban-scale building modeling benchmark, allowing a comparison of supervised and self-supervised learning methods. We believe that our Building3D will facilitate future research on urban modeling, aerial path planning, mesh simplification, and semantic/part segmentation etc.
CVApr 24Code
Region Matters: Efficient and Reliable Region-Aware Visual Place RecognitionShunpeng Chen, Yukun Song, Changwei Wang et al.
Visual Place Recognition (VPR) determines a query image's geographic location by matching it against geotagged databases. However, existing methods struggle with perceptual aliasing caused by irrelevant regions and inefficient re-ranking due to rigid candidate scheduling. To address these issues, we introduce FoL++, a method combining robust discriminative region modeling with adaptive re-ranking. Specifically, we propose a Reliability Estimation Branch to generate spatial reliability maps that explicitly model occlusion resistance. This representation is further optimized by two spatial alignment losses (SAL and SCEL) to effectively align features and highlight salient regions. For weakly supervised learning without manual annotations, a pseudo-correspondence strategy generates dense local feature supervision directly from aggregation clusters. Our Adaptive Candidate Scheduler dynamically resizes candidate pools based on global similarity. By weighting local matches by reliability and adaptively fusing global and local evidence, FoL++ surpasses traditional independent matching systems. Extensive experiments across seven benchmarks demonstrate that FoL++ achieves state-of-the-art performance with a lightweight memory footprint, improving inference speed by 40% over FoL. Code and models will be released (and merged with FoL) at https://github.com/chenshunpeng/FoL.
CVNov 18, 2023
PBWR: Parametric Building Wireframe Reconstruction from Aerial LiDAR Point CloudsShangfeng Huang, Ruisheng Wang, Bo Guo et al.
In this paper, we present an end-to-end 3D building wireframe reconstruction method to regress edges directly from aerial LiDAR point clouds.Our method, named Parametric Building Wireframe Reconstruction (PBWR), takes aerial LiDAR point clouds and initial edge entities as input, and fully uses self-attention mechanism of transformers to regress edge parameters without any intermediate steps such as corner prediction. We propose an edge non-maximum suppression (E-NMS) module based on edge similarityto remove redundant edges. Additionally, a dedicated edge loss function is utilized to guide the PBWR in regressing edges parameters, where simple use of edge distance loss isn't suitable. In our experiments, we demonstrate state-of-the-art results on the Building3D dataset, achieving an improvement of approximately 36% in entry-level dataset edge accuracy and around 42% improvement in the Tallinn dataset.
CVOct 15, 2025Code
Complementary Information Guided Occupancy Prediction via Multi-Level Representation FusionRongtao Xu, Jinzhou Lin, Jialei Zhou et al.
Camera-based occupancy prediction is a mainstream approach for 3D perception in autonomous driving, aiming to infer complete 3D scene geometry and semantics from 2D images. Almost existing methods focus on improving performance through structural modifications, such as lightweight backbones and complex cascaded frameworks, with good yet limited performance. Few studies explore from the perspective of representation fusion, leaving the rich diversity of features in 2D images underutilized. Motivated by this, we propose \textbf{CIGOcc, a two-stage occupancy prediction framework based on multi-level representation fusion. \textbf{CIGOcc extracts segmentation, graphics, and depth features from an input image and introduces a deformable multi-level fusion mechanism to fuse these three multi-level features. Additionally, CIGOcc incorporates knowledge distilled from SAM to further enhance prediction accuracy. Without increasing training costs, CIGOcc achieves state-of-the-art performance on the SemanticKITTI benchmark. The code is provided in the supplementary material and will be released https://github.com/VitaLemonTea1/CIGOcc
CVAug 17, 2021Code
DRB-GAN: A Dynamic ResBlock Generative Adversarial Network for Artistic Style TransferWenju Xu, Chengjiang Long, Ruisheng Wang et al.
The paper proposes a Dynamic ResBlock Generative Adversarial Network (DRB-GAN) for artistic style transfer. The style code is modeled as the shared parameters for Dynamic ResBlocks connecting both the style encoding network and the style transfer network. In the style encoding network, a style class-aware attention mechanism is used to attend the style feature representation for generating the style codes. In the style transfer network, multiple Dynamic ResBlocks are designed to integrate the style code and the extracted CNN semantic feature and then feed into the spatial window Layer-Instance Normalization (SW-LIN) decoder, which enables high-quality synthetic images with artistic style transfer. Moreover, the style collection conditional discriminator is designed to equip our DRB-GAN model with abilities for both arbitrary style transfer and collection style transfer during the training stage. No matter for arbitrary style transfer or collection style transfer, extensive experiments strongly demonstrate that our proposed DRB-GAN outperforms state-of-the-art methods and exhibits its superior performance in terms of visual quality and efficiency. Our source code is available at \color{magenta}{\url{https://github.com/xuwenju123/DRB-GAN}}.
LGSep 6, 2023
A Multimodal Learning Framework for Comprehensive 3D Mineral Prospectivity Modeling with Jointly Learned Structure-Fluid RelationshipsYang Zheng, Hao Deng, Ruisheng Wang et al.
This study presents a novel multimodal fusion model for three-dimensional mineral prospectivity mapping (3D MPM), effectively integrating structural and fluid information through a deep network architecture. Leveraging Convolutional Neural Networks (CNN) and Multilayer Perceptrons (MLP), the model employs canonical correlation analysis (CCA) to align and fuse multimodal features. Rigorous evaluation on the Jiaojia gold deposit dataset demonstrates the model's superior performance in distinguishing ore-bearing instances and predicting mineral prospectivity, outperforming other models in result analyses. Ablation studies further reveal the benefits of joint feature utilization and CCA incorporation. This research not only advances mineral prospectivity modeling but also highlights the pivotal role of data integration and feature alignment for enhanced exploration decision-making.
CVMar 4
LeafInst - Unified Instance Segmentation Network for Fine-Grained Forestry Leaf Phenotype Analysis: A New UAV based BenchmarkTaige Luo, Junru Xie, Chenyang Fan et al.
Intelligent forest tree breeding has advanced plant phenotyping, yet existing research largely focuses on large-leaf agricultural crops, with limited attention to fine-grained leaf analysis of sapling trees in open-field environments. Natural scenes introduce challenges including scale variation, illumination changes, and irregular leaf morphology. To address these issues, we collected UAV RGB imagery of field-grown saplings and constructed the Poplar-leaf dataset, containing 1,202 branches and 19,876 pixel-level annotated leaf instances. To our knowledge, this is the first instance segmentation dataset specifically designed for forestry leaves in open-field conditions. We propose LeafInst, a novel segmentation framework tailored for irregular and multi-scale leaf structures. The model integrates an Asymptotic Feature Pyramid Network (AFPN) for multi-scale perception, a Dynamic Asymmetric Spatial Perception (DASP) module for irregular shape modeling, and a dual-residual Dynamic Anomalous Regression Head (DARH) with Top-down Concatenation decoder Feature Fusion (TCFU) to improve detection and segmentation performance. On Poplar-leaf, LeafInst achieves 68.4 mAP, outperforming YOLOv11 by 7.1 percent and MaskDINO by 6.5 percent. On the public PhenoBench benchmark, it reaches 52.7 box mAP, exceeding MaskDINO by 3.4 percent. Additional experiments demonstrate strong generalization and practical utility for large-scale leaf phenotyping.
CVNov 9, 2025
BuildingWorld: A Structured 3D Building Dataset for Urban Foundation ModelsShangfeng Huang, Ruisheng Wang, Xin Wang
As digital twins become central to the transformation of modern cities, accurate and structured 3D building models emerge as a key enabler of high-fidelity, updatable urban representations. These models underpin diverse applications including energy modeling, urban planning, autonomous navigation, and real-time reasoning. Despite recent advances in 3D urban modeling, most learning-based models are trained on building datasets with limited architectural diversity, which significantly undermines their generalizability across heterogeneous urban environments. To address this limitation, we present BuildingWorld, a comprehensive and structured 3D building dataset designed to bridge the gap in stylistic diversity. It encompasses buildings from geographically and architecturally diverse regions -- including North America, Europe, Asia, Africa, and Oceania -- offering a globally representative dataset for urban-scale foundation modeling and analysis. Specifically, BuildingWorld provides about five million LOD2 building models collected from diverse sources, accompanied by real and simulated airborne LiDAR point clouds. This enables comprehensive research on 3D building reconstruction, detection and segmentation. Cyber City, a virtual city model, is introduced to enable the generation of unlimited training data with customized and structurally diverse point cloud distributions. Furthermore, we provide standardized evaluation metrics tailored for building reconstruction, aiming to facilitate the training, evaluation, and comparison of large-scale vision models and foundation models in structured 3D urban environments.
CVApr 3, 2024
APC2Mesh: Bridging the gap from occluded building façades to full 3D modelsPerpetual Hope Akwensi, Akshay Bharadwaj, Ruisheng Wang
The benefits of having digital twins of urban buildings are numerous. However, a major difficulty encountered in their creation from airborne LiDAR point clouds is the effective means of accurately reconstructing significant occlusions amidst point density variations and noise. To bridge the noise/sparsity/occlusion gap and generate high fidelity 3D building models, we propose APC2Mesh which integrates point completion into a 3D reconstruction pipeline, enabling the learning of dense geometrically accurate representation of buildings. Specifically, we leveraged complete points generated from occluded ones as input to a linearized skip attention-based deformation network for 3D mesh reconstruction. In our experiments, conducted on 3 different scenes, we demonstrate that: (1) APC2Mesh delivers comparatively superior results, indicating its efficacy in handling the challenges of occluded airborne building points of diverse styles and complexities. (2) The combination of point completion with typical deep learning-based 3D point cloud reconstruction methods offers a direct and effective solution for reconstructing significantly occluded airborne building points. As such, this neural integration holds promise for advancing the creation of digital twins for urban buildings with greater accuracy and fidelity.
CVJun 27, 2025
SDRNET: Stacked Deep Residual Network for Accurate Semantic Segmentation of Fine-Resolution Remotely Sensed ImagesNaftaly Wambugu, Ruisheng Wang, Bo Guo et al.
Land cover maps generated from semantic segmentation of high-resolution remotely sensed images have drawn mucon in the photogrammetry and remote sensing research community. Currently, massive fine-resolution remotely sensed (FRRS) images acquired by improving sensing and imaging technologies become available. However, accurate semantic segmentation of such FRRS images is greatly affected by substantial class disparities, the invisibility of key ground objects due to occlusion, and object size variation. Despite the extraordinary potential in deep convolutional neural networks (DCNNs) in image feature learning and representation, extracting sufficient features from FRRS images for accurate semantic segmentation is still challenging. These challenges demand the deep learning models to learn robust features and generate sufficient feature descriptors. Specifically, learning multi-contextual features to guarantee adequate coverage of varied object sizes from the ground scene and harnessing global-local contexts to overcome class disparities challenge even profound networks. Deeper networks significantly lose spatial details due to gradual downsampling processes resulting in poor segmentation results and coarse boundaries. This article presents a stacked deep residual network (SDRNet) for semantic segmentation from FRRS images. The proposed framework utilizes two stacked encoder-decoder networks to harness long-range semantics yet preserve spatial information and dilated residual blocks (DRB) between each encoder and decoder network to capture sufficient global dependencies thus improving segmentation performance. Our experimental results obtained using the ISPRS Vaihingen and Potsdam datasets demonstrate that the SDRNet performs effectively and competitively against current DCNNs in semantic segmentation.
CVMar 6, 2017
An optimal hierarchical clustering approach to segmentation of mobile LiDAR point cloudsSheng Xu, Ruisheng Wang, Han Zheng
This paper proposes a hierarchical clustering approach for the segmentation of mobile LiDAR point clouds. We perform the hierarchical clustering on unorganized point clouds based on a proximity matrix. The dissimilarity measure in the proximity matrix is calculated by the Euclidean distances between clusters and the difference of normal vectors at given points. The main contribution of this paper is that we succeed to optimize the combination of clusters in the hierarchical clustering. The combination is obtained by achieving the matching of a bipartite graph, and optimized by solving the minimum-cost perfect matching. Results show that the proposed optimal hierarchical clustering (OHC) succeeds to achieve the segmentation of multiple individual objects automatically and outperforms the state-of-the-art LiDAR point cloud segmentation approaches.
CVOct 15, 2016
Road Curb Extraction from Mobile LiDAR Point CloudsSheng Xu, Ruisheng Wang, Han Zheng
Automatic extraction of road curbs from uneven, unorganized, noisy and massive 3D point clouds is a challenging task. Existing methods often project 3D point clouds onto 2D planes to extract curbs. However, the projection causes loss of 3D information which degrades the performance of the detection. This paper presents a robust, accurate and efficient method to extract road curbs from 3D mobile LiDAR point clouds. Our method consists of two steps: 1) extracting the candidate points of curbs based on the proposed novel energy function and 2) refining the candidate points using the proposed least cost path model. We evaluated our method on a large-scale of residential area (16.7GB, 300 million points) and an urban area (1.07GB, 20 million points) mobile LiDAR point clouds. Results indicate that the proposed method is superior to the state-of-the-art methods in terms of robustness, accuracy and efficiency. The proposed curb extraction method achieved a completeness of 78.62% and a correctness of 83.29%. These experiments demonstrate that the proposed method is a promising solution to extract road curbs from mobile LiDAR point clouds.
HCMay 20, 2014
Perceiving Motion Cues Inspired by Microsoft Kinect Sensor on Game ExperiencingJiawei Xu, Shigang Yue, Ruisheng Wang et al.
This paper proposed a novel method to replace the traditional mouse controller by using Microsoft Kinect Sensor to realize the functional implementation on human-machine interaction. With human hand gestures and movements, Kinect Sensor could accurately recognize the participants intention and transmit our order to desktop or laptop. In addition, the trend in current HCI market is giving the customer more freedom and experiencing feeling by involving human cognitive factors more deeply. Kinect sensor receives the motion cues continuously from the humans intention and feedback the reaction during the experiments. The comparison accuracy between the hand movement and mouse cursor demonstrates the efficiency for the proposed method. In addition, the experimental results on hit rate in the game of Fruit Ninja and Shape Touching proves the real-time ability of the proposed framework. The performance evaluation built up a promise foundation for the further applications in the field of human-machine interaction. The contribution of this work is the expansion on hand gesture perception and early formulation on Mac iPad.
ROMay 13, 2014
A Cognitive Model for Humanoid Robot Navigation and Mapping using Alderbaran NAOJiawei Xu, Ruisheng Wang, Shigang Yue et al.
The aim of this work is to build a cognitive model for the humanoid robot, especially, we are interested in the navigation and mapping on the humanoid robot. The agents used are the Alderbaran NAO robot. The framework is effectively applied to the integration of AI, computer vision, and signal processing problems. Our model can be divided into two parts, cognitive mapping and perception. Cognitive mapping is assumed as three parts, whose representations were proposed a network of ASRs, an MFIS, and a hierarchy of Place Representations. On the other hand, perception is the traditional computer vision problem, which is the image sensing, feature extraction and interested objects tracking. The points of our project can be concluded as the following. Firstly, the robotics should realize where it is. Second, we would like to test the theory that this is how humans map their environment. The humanoid robot inspires the human vision searching by integrating the visual mechanism and computer vision techniques.