Yifan Zuo

CV
h-index13
16papers
161citations
Novelty54%
AI Score54

16 Papers

CVDec 8, 2022Code
EPCL: Frozen CLIP Transformer is An Efficient Point Cloud Encoder

Xiaoshui Huang, Zhou Huang, Sheng Li et al.

The pretrain-finetune paradigm has achieved great success in NLP and 2D image fields because of the high-quality representation ability and transferability of their pretrained models. However, pretraining such a strong model is difficult in the 3D point cloud field due to the limited amount of point cloud sequences. This paper introduces \textbf{E}fficient \textbf{P}oint \textbf{C}loud \textbf{L}earning (EPCL), an effective and efficient point cloud learner for directly training high-quality point cloud models with a frozen CLIP transformer. Our EPCL connects the 2D and 3D modalities by semantically aligning the image features and point cloud features without paired 2D-3D data. Specifically, the input point cloud is divided into a series of local patches, which are converted to token embeddings by the designed point cloud tokenizer. These token embeddings are concatenated with a task token and fed into the frozen CLIP transformer to learn point cloud representation. The intuition is that the proposed point cloud tokenizer projects the input point cloud into a unified token space that is similar to the 2D images. Comprehensive experiments on 3D detection, semantic segmentation, classification and few-shot learning demonstrate that the CLIP transformer can serve as an efficient point cloud encoder and our method achieves promising performance on both indoor and outdoor benchmarks. In particular, performance gains brought by our EPCL are $\textbf{19.7}$ AP$_{50}$ on ScanNet V2 detection, $\textbf{4.4}$ mIoU on S3DIS segmentation and $\textbf{1.2}$ mIoU on SemanticKITTI segmentation compared to contemporary pretrained models. Code is available at \url{https://github.com/XiaoshuiHuang/EPCL}.

CVNov 1, 2022Code
GMF: General Multimodal Fusion Framework for Correspondence Outlier Rejection

Xiaoshui Huang, Wentao Qu, Yifan Zuo et al.

Rejecting correspondence outliers enables to boost the correspondence quality, which is a critical step in achieving high point cloud registration accuracy. The current state-of-the-art correspondence outlier rejection methods only utilize the structure features of the correspondences. However, texture information is critical to reject the correspondence outliers in our human vision system. In this paper, we propose General Multimodal Fusion (GMF) to learn to reject the correspondence outliers by leveraging both the structure and texture information. Specifically, two cross-attention-based fusion layers are proposed to fuse the texture information from paired images and structure information from point correspondences. Moreover, we propose a convolutional position encoding layer to enhance the difference between Tokens and enable the encoding feature pay attention to neighbor information. Our position encoding layer will make the cross-attention operation integrate both local and global information. Experiments on multiple datasets(3DMatch, 3DLoMatch, KITTI) and recent state-of-the-art models (3DRegNet, DGR, PointDSC) prove that our GMF achieves wide generalization ability and consistently improves the point cloud registration accuracy. Furthermore, several ablation studies demonstrate the robustness of the proposed GMF on different loss functions, lighting conditions and noises.The code is available at https://github.com/XiaoshuiHuang/GMF.

IVNov 26, 2022
CFNet: Conditional Filter Learning with Dynamic Noise Estimation for Real Image Denoising

Yifan Zuo, Jiacheng Xie, Yuming Fang et al.

A mainstream type of the state of the arts (SOTAs) based on convolutional neural network (CNN) for real image denoising contains two sub-problems, i.e., noise estimation and non-blind denoising. This paper considers real noise approximated by heteroscedastic Gaussian/Poisson Gaussian distributions with in-camera signal processing pipelines. The related works always exploit the estimated noise prior via channel-wise concatenation followed by a convolutional layer with spatially sharing kernels. Due to the variable modes of noise strength and frequency details of all feature positions, this design cannot adaptively tune the corresponding denoising patterns. To address this problem, we propose a novel conditional filter in which the optimal kernels for different feature positions can be adaptively inferred by local features from the image and the noise map. Also, we bring the thought that alternatively performs noise estimation and non-blind denoising into CNN structure, which continuously updates noise prior to guide the iterative feature denoising. In addition, according to the property of heteroscedastic Gaussian distribution, a novel affine transform block is designed to predict the stationary noise component and the signal-dependent noise component. Compared with SOTAs, extensive experiments are conducted on five synthetic datasets and three real datasets, which shows the improvement of the proposed CFNet.

CVAug 21, 2024Code
GSTran: Joint Geometric and Semantic Coherence for Point Cloud Segmentation

Abiao Li, Chenlei Lv, Guofeng Mei et al.

Learning meaningful local and global information remains a challenge in point cloud segmentation tasks. When utilizing local information, prior studies indiscriminately aggregates neighbor information from different classes to update query points, potentially compromising the distinctive feature of query points. In parallel, inaccurate modeling of long-distance contextual dependencies when utilizing global information can also impact model performance. To address these issues, we propose GSTran, a novel transformer network tailored for the segmentation task. The proposed network mainly consists of two principal components: a local geometric transformer and a global semantic transformer. In the local geometric transformer module, we explicitly calculate the geometric disparity within the local region. This enables amplifying the affinity with geometrically similar neighbor points while suppressing the association with other neighbors. In the global semantic transformer module, we design a multi-head voting strategy. This strategy evaluates semantic similarity across the entire spatial range, facilitating the precise capture of contextual dependencies. Experiments on ShapeNetPart and S3DIS benchmarks demonstrate the effectiveness of the proposed method, showing its superiority over other algorithms. The code is available at https://github.com/LAB123-tech/GSTran.

GNDec 2, 2025Code
PanFoMa: A Lightweight Foundation Model and Benchmark for Pan-Cancer

Xiaoshui Huang, Tianlin Zhu, Yifan Zuo et al.

Single-cell RNA sequencing (scRNA-seq) is essential for decoding tumor heterogeneity. However, pan-cancer research still faces two key challenges: learning discriminative and efficient single-cell representations, and establishing a comprehensive evaluation benchmark. In this paper, we introduce PanFoMa, a lightweight hybrid neural network that combines the strengths of Transformers and state-space models to achieve a balance between performance and efficiency. PanFoMa consists of a front-end local-context encoder with shared self-attention layers to capture complex, order-independent gene interactions; and a back-end global sequential feature decoder that efficiently integrates global context using a linear-time state-space model. This modular design preserves the expressive power of Transformers while leveraging the scalability of Mamba to enable transcriptome modeling, effectively capturing both local and global regulatory signals. To enable robust evaluation, we also construct a large-scale pan-cancer single-cell benchmark, PanFoMaBench, containing over 3.5 million high-quality cells across 33 cancer subtypes, curated through a rigorous preprocessing pipeline. Experimental results show that PanFoMa outperforms state-of-the-art models on our pan-cancer benchmark (+4.0\%) and across multiple public tasks, including cell type annotation (+7.4\%), batch integration (+4.0\%) and multi-omics integration (+3.1\%). The code is available at https://github.com/Xiaoshui-Huang/PanFoMa.

CVSep 8, 2024
CD-NGP: A Fast Scalable Continual Representation for Dynamic Scenes

Zhenhuan Liu, Shuai Liu, Zhiwei Ning et al.

Novel view synthesis (NVS) in dynamic scenes faces persistent challenges in memory consumption, model complexity, training efficiency, and rendering quality. Offline methods offer high fidelity but suffer from high memory usage and limited scalability, while online approaches often trade quality for speed and compactness. We propose Continual Dynamic Neural Graphics Primitives (CD-NGP), a continual learning framework that reduces memory overhead and enhances scalability through parameter reuse. To avoid feature interference in dynamic scenes and improve rendering quality, our method combines spatial and temporal hash encodings, which compactly represent scene structures and motion patterns. We also introduce a new dataset comprising multi-view, long-duration ($>1200$ frames) videos with both rigid and non-rigid motion, which is not found in existing benchmarks. CD-NGP is evaluated on public datasets and our long video dataset, demonstrating superior scalability and reconstruction quality. It significantly reduces training memory usage to <14GB and requires only 0.4MB/frame in streaming bandwidth on DyNeRF -- substantially lower than most online baselines.

IVJan 20, 2025Code
Subjective and Objective Quality Assessment of Non-Uniformly Distorted Omnidirectional Images

Jiebin Yan, Jiale Rao, Xuelin Liu et al.

Omnidirectional image quality assessment (OIQA) has been one of the hot topics in IQA with the continuous development of VR techniques, and achieved much success in the past few years. However, most studies devote themselves to the uniform distortion issue, i.e., all regions of an omnidirectional image are perturbed by the ``same amount'' of noise, while ignoring the non-uniform distortion issue, i.e., partial regions undergo ``different amount'' of perturbation with the other regions in the same omnidirectional image. Additionally, nearly all OIQA models are verified on the platforms containing a limited number of samples, which largely increases the over-fitting risk and therefore impedes the development of OIQA. To alleviate these issues, we elaborately explore this topic from both subjective and objective perspectives. Specifically, we construct a large OIQA database containing 10,320 non-uniformly distorted omnidirectional images, each of which is generated by considering quality impairments on one or two camera len(s). Then we meticulously conduct psychophysical experiments and delve into the influence of both holistic and individual factors (i.e., distortion range and viewing condition) on omnidirectional image quality. Furthermore, we propose a perception-guided OIQA model for non-uniform distortion by adaptively simulating users' viewing behavior. Experimental results demonstrate that the proposed model outperforms state-of-the-art methods. The source code is available at https://github.com/RJL2000/OIQAND.

IVFeb 26, 2025Code
Max360IQ: Blind Omnidirectional Image Quality Assessment with Multi-axis Attention

Jiebin Yan, Ziwen Tan, Yuming Fang et al.

Omnidirectional image, also called 360-degree image, is able to capture the entire 360-degree scene, thereby providing more realistic immersive feelings for users than general 2D image and stereoscopic image. Meanwhile, this feature brings great challenges to measuring the perceptual quality of omnidirectional images, which is closely related to users' quality of experience, especially when the omnidirectional images suffer from non-uniform distortion. In this paper, we propose a novel and effective blind omnidirectional image quality assessment (BOIQA) model with multi-axis attention (Max360IQ), which can proficiently measure not only the quality of uniformly distorted omnidirectional images but also the quality of non-uniformly distorted omnidirectional images. Specifically, the proposed Max360IQ is mainly composed of a backbone with stacked multi-axis attention modules for capturing both global and local spatial interactions of extracted viewports, a multi-scale feature integration (MSFI) module to fuse multi-scale features and a quality regression module with deep semantic guidance for predicting the quality of omnidirectional images. Experimental results demonstrate that the proposed Max360IQ outperforms the state-of-the-art Assessor360 by 3.6\% in terms of SRCC on the JUFE database with non-uniform distortion, and gains improvement of 0.4\% and 0.8\% in terms of SRCC on the OIQA and CVIQ databases, respectively. The source code is available at https://github.com/WenJuing/Max360IQ.

CVMar 27, 2025Code
Recurrent Feature Mining and Keypoint Mixup Padding for Category-Agnostic Pose Estimation

Junjie Chen, Weilong Chen, Yifan Zuo et al.

Category-agnostic pose estimation aims to locate keypoints on query images according to a few annotated support images for arbitrary novel classes. Existing methods generally extract support features via heatmap pooling, and obtain interacted features from support and query via cross-attention. Hence, these works neglect to mine fine-grained and structure-aware (FGSA) features from both support and query images, which are crucial for pixel-level keypoint localization. To this end, we propose a novel yet concise framework, which recurrently mines FGSA features from both support and query images. Specifically, we design a FGSA mining module based on deformable attention mechanism. On the one hand, we mine fine-grained features by applying deformable attention head over multi-scale feature maps. On the other hand, we mine structure-aware features by offsetting the reference points of keypoints to their linked keypoints. By means of above module, we recurrently mine FGSA features from support and query images, and thus obtain better support features and query estimations. In addition, we propose to use mixup keypoints to pad various classes to a unified keypoint number, which could provide richer supervision than the zero padding used in existing works. We conduct extensive experiments and in-depth studies on large-scale MP-100 dataset, and outperform SOTA method dramatically (+3.2\%PCK@0.05). Code is avaiable at https://github.com/chenbys/FMMP.

CVJul 7, 2025Code
PointGAC: Geometric-Aware Codebook for Masked Point Cloud Modeling

Abiao Li, Chenlei Lv, Yuming Fang et al.

Most masked point cloud modeling (MPM) methods follow a regression paradigm to reconstruct the coordinate or feature of masked regions. However, they tend to over-constrain the model to learn the details of the masked region, resulting in failure to capture generalized features. To address this limitation, we propose \textbf{\textit{PointGAC}}, a novel clustering-based MPM method that aims to align the feature distribution of masked regions. Specially, it features an online codebook-guided teacher-student framework. Firstly, it presents a geometry-aware partitioning strategy to extract initial patches. Then, the teacher model updates a codebook via online k-means based on features extracted from the complete patches. This procedure facilitates codebook vectors to become cluster centers. Afterward, we assigns the unmasked features to their corresponding cluster centers, and the student model aligns the assignment for the reconstructed masked features. This strategy focuses on identifying the cluster centers to which the masked features belong, enabling the model to learn more generalized feature representations. Benefiting from a proposed codebook maintenance mechanism, codebook vectors are actively updated, which further increases the efficiency of semantic feature learning. Experiments validate the effectiveness of the proposed method on various downstream tasks. Code is available at https://github.com/LAB123-tech/PointGAC

CVJan 14, 2025
PSReg: Prior-guided Sparse Mixture of Experts for Point Cloud Registration

Xiaoshui Huang, Zhou Huang, Yifan Zuo et al.

The discriminative feature is crucial for point cloud registration. Recent methods improve the feature discriminative by distinguishing between non-overlapping and overlapping region points. However, they still face challenges in distinguishing the ambiguous structures in the overlapping regions. Therefore, the ambiguous features they extracted resulted in a significant number of outlier matches from overlapping regions. To solve this problem, we propose a prior-guided SMoE-based registration method to improve the feature distinctiveness by dispatching the potential correspondences to the same experts. Specifically, we propose a prior-guided SMoE module by fusing prior overlap and potential correspondence embeddings for routing, assigning tokens to the most suitable experts for processing. In addition, we propose a registration framework by a specific combination of Transformer layer and prior-guided SMoE module. The proposed method not only pays attention to the importance of locating the overlapping areas of point clouds, but also commits to finding more accurate correspondences in overlapping areas. Our extensive experiments demonstrate the effectiveness of our method, achieving state-of-the-art registration recall (95.7\%/79.3\%) on the 3DMatch/3DLoMatch benchmark. Moreover, we also test the performance on ModelNet40 and demonstrate excellent performance.

CVAug 18, 2025
CMF-IoU: Multi-Stage Cross-Modal Fusion 3D Object Detection with IoU Joint Prediction

Zhiwei Ning, Zhaojiang Liu, Xuanang Gao et al.

Multi-modal methods based on camera and LiDAR sensors have garnered significant attention in the field of 3D detection. However, many prevalent works focus on single or partial stage fusion, leading to insufficient feature extraction and suboptimal performance. In this paper, we introduce a multi-stage cross-modal fusion 3D detection framework, termed CMF-IOU, to effectively address the challenge of aligning 3D spatial and 2D semantic information. Specifically, we first project the pixel information into 3D space via a depth completion network to get the pseudo points, which unifies the representation of the LiDAR and camera information. Then, a bilateral cross-view enhancement 3D backbone is designed to encode LiDAR points and pseudo points. The first sparse-to-distant (S2D) branch utilizes an encoder-decoder structure to reinforce the representation of sparse LiDAR points. The second residual view consistency (ResVC) branch is proposed to mitigate the influence of inaccurate pseudo points via both the 3D and 2D convolution processes. Subsequently, we introduce an iterative voxel-point aware fine grained pooling module, which captures the spatial information from LiDAR points and textural information from pseudo points in the proposal refinement stage. To achieve more precise refinement during iteration, an intersection over union (IoU) joint prediction branch integrated with a novel proposals generation technique is designed to preserve the bounding boxes with both high IoU and classification scores. Extensive experiments show the superior performance of our method on the KITTI, nuScenes and Waymo datasets.

CVSep 8, 2025
Cross3DReg: Towards a Large-scale Real-world Cross-source Point Cloud Registration Benchmark

Zongyi Xu, Zhongpeng Lang, Yilong Chen et al.

Cross-source point cloud registration, which aims to align point cloud data from different sensors, is a fundamental task in 3D vision. However, compared to the same-source point cloud registration, cross-source registration faces two core challenges: the lack of publicly available large-scale real-world datasets for training the deep registration models, and the inherent differences in point clouds captured by multiple sensors. The diverse patterns induced by the sensors pose great challenges in robust and accurate point cloud feature extraction and matching, which negatively influence the registration accuracy. To advance research in this field, we construct Cross3DReg, the currently largest and real-world multi-modal cross-source point cloud registration dataset, which is collected by a rotating mechanical lidar and a hybrid semi-solid-state lidar, respectively. Moreover, we design an overlap-based cross-source registration framework, which utilizes unaligned images to predict the overlapping region between source and target point clouds, effectively filtering out redundant points in the irrelevant regions and significantly mitigating the interference caused by noise in non-overlapping areas. Then, a visual-geometric attention guided matching module is proposed to enhance the consistency of cross-source point cloud features by fusing image and geometric information to establish reliable correspondences and ultimately achieve accurate and robust registration. Extensive experiments show that our method achieves state-of-the-art registration performance. Our framework reduces the relative rotation error (RRE) and relative translation error (RTE) by $63.2\%$ and $40.2\%$, respectively, and improves the registration recall (RR) by $5.4\%$, which validates its effectiveness in achieving accurate cross-source registration.

CVMar 5, 2025
Computational Analysis of Degradation Modeling in Blind Panoramic Image Quality Assessment

Jiebin Yan, Ziwen Tan, Jiale Rao et al.

Blind panoramic image quality assessment (BPIQA) has recently brought new challenge to the visual quality community, due to the complex interaction between immersive content and human behavior. Although many efforts have been made to advance BPIQA from both conducting psychophysical experiments and designing performance-driven objective algorithms, \textit{limited content} and \textit{few samples} in those closed sets inevitably would result in shaky conclusions, thereby hindering the development of BPIQA, we refer to it as the \textit{easy-database} issue. In this paper, we present a sufficient computational analysis of degradation modeling in BPIQA to thoroughly explore the \textit{easy-database issue}, where we carefully design three types of experiments via investigating the gap between BPIQA and blind image quality assessment (BIQA), the necessity of specific design in BPIQA models, and the generalization ability of BPIQA models. From extensive experiments, we find that easy databases narrow the gap between the performance of BPIQA and BIQA models, which is unconducive to the development of BPIQA. And the easy databases make the BPIQA models be closed to saturation, therefore the effectiveness of the associated specific designs can not be well verified. Besides, the BPIQA models trained on our recently proposed databases with complicated degradation show better generalization ability. Thus, we believe that much more efforts are highly desired to put into BPIQA from both subjective viewpoint and objective viewpoint.

CVNov 23, 2021
GenReg: Deep Generative Method for Fast Point Cloud Registration

Xiaoshui Huang, Zongyi Xu, Guofeng Mei et al.

Accurate and efficient point cloud registration is a challenge because the noise and a large number of points impact the correspondence search. This challenge is still a remaining research problem since most of the existing methods rely on correspondence search. To solve this challenge, we propose a new data-driven registration algorithm by investigating deep generative neural networks to point cloud registration. Given two point clouds, the motivation is to generate the aligned point clouds directly, which is very useful in many applications like 3D matching and search. We design an end-to-end generative neural network for aligned point clouds generation to achieve this motivation, containing three novel components. Firstly, a point multi-perception layer (MLP) mixer (PointMixer) network is proposed to efficiently maintain both the global and local structure information at multiple levels from the self point clouds. Secondly, a feature interaction module is proposed to fuse information from cross point clouds. Thirdly, a parallel and differential sample consensus method is proposed to calculate the transformation matrix of the input point clouds based on the generated registration results. The proposed generative neural network is trained in a GAN framework by maintaining the data distribution and structure similarity. The experiments on both ModelNet40 and 7Scene datasets demonstrate that the proposed algorithm achieves state-of-the-art accuracy and efficiency. Notably, our method reduces $2\times$ in registration error (CD) and $12\times$ running time compared to the state-of-the-art correspondence-based algorithm.

CVNov 18, 2021
IMFNet: Interpretable Multimodal Fusion for Point Cloud Registration

Xiaoshui Huang, Wentao Qu, Yifan Zuo et al.

The existing state-of-the-art point descriptor relies on structure information only, which omit the texture information. However, texture information is crucial for our humans to distinguish a scene part. Moreover, the current learning-based point descriptors are all black boxes which are unclear how the original points contribute to the final descriptor. In this paper, we propose a new multimodal fusion method to generate a point cloud registration descriptor by considering both structure and texture information. Specifically, a novel attention-fusion module is designed to extract the weighted texture information for the descriptor extraction. In addition, we propose an interpretable module to explain the original points in contributing to the final descriptor. We use the descriptor element as the loss to backpropagate to the target layer and consider the gradient as the significance of this point to the final descriptor. This paper moves one step further to explainable deep learning in the registration task. Comprehensive experiments on 3DMatch, 3DLoMatch and KITTI demonstrate that the multimodal fusion descriptor achieves state-of-the-art accuracy and improve the descriptor's distinctiveness. We also demonstrate that our interpretable module in explaining the registration descriptor extraction.