Zhiyang Liu

CV
h-index15
13papers
342citations
Novelty42%
AI Score52

13 Papers

CVJul 5, 2023
The KiTS21 Challenge: Automatic segmentation of kidneys, renal tumors, and renal cysts in corticomedullary-phase CT

Nicholas Heller, Fabian Isensee, Dasha Trofimova et al.

This paper presents the challenge report for the 2021 Kidney and Kidney Tumor Segmentation Challenge (KiTS21) held in conjunction with the 2021 international conference on Medical Image Computing and Computer Assisted Interventions (MICCAI). KiTS21 is a sequel to its first edition in 2019, and it features a variety of innovations in how the challenge was designed, in addition to a larger dataset. A novel annotation method was used to collect three separate annotations for each region of interest, and these annotations were performed in a fully transparent setting using a web-based annotation tool. Further, the KiTS21 test set was collected from an outside institution, challenging participants to develop methods that generalize well to new populations. Nonetheless, the top-performing teams achieved a significant improvement over the state of the art set in 2019, and this performance is shown to inch ever closer to human-level performance. An in-depth meta-analysis is presented describing which methods were used and how they faired on the leaderboard, as well as the characteristics of which cases generally saw good performance, and which did not. Overall KiTS21 facilitated a significant advancement in the state of the art in kidney tumor segmentation, and provides useful insights that are applicable to the field of semantic segmentation as a whole.

CVSep 5, 2023Code
DR-Pose: A Two-stage Deformation-and-Registration Pipeline for Category-level 6D Object Pose Estimation

Lei Zhou, Zhiyang Liu, Runze Gan et al.

Category-level object pose estimation involves estimating the 6D pose and the 3D metric size of objects from predetermined categories. While recent approaches take categorical shape prior information as reference to improve pose estimation accuracy, the single-stage network design and training manner lead to sub-optimal performance since there are two distinct tasks in the pipeline. In this paper, the advantage of two-stage pipeline over single-stage design is discussed. To this end, we propose a two-stage deformation-and registration pipeline called DR-Pose, which consists of completion-aided deformation stage and scaled registration stage. The first stage uses a point cloud completion method to generate unseen parts of target object, guiding subsequent deformation on the shape prior. In the second stage, a novel registration network is designed to extract pose-sensitive features and predict the representation of object partial point cloud in canonical space based on the deformation results from the first stage. DR-Pose produces superior results to the state-of-the-art shape prior-based methods on both CAMERA25 and REAL275 benchmarks. Codes are available at https://github.com/Zray26/DR-Pose.git.

73.5CVMay 29
NTR: Neural Token Reconstruction for Scene Token Bottleneck in End-to-End Driving

Jiahui Li, Jiawei Sun, Zixiang Ren et al.

Recent perception-free end-to-end (E2E) autonomous driving methods bypass explicit perception outputs by compressing dense image patch tokens into compact scene tokens for downstream trajectory generation and scoring. While these scene tokens form a compact visual bottleneck for the planner, they receive supervision solely from the planning objective, providing limited constraints on the encoded visual information. To address this limitation, we introduce Neural Token Reconstruction (NTR), a representation learning framework to directly constrain the compact scene-token bottleneck in perception-free driving. NTR introduces a self-distillation masked latent reconstruction objective that reconstructs masked patch-level latent features using only compact scene tokens as reconstruction memory. This forces reconstruction gradients to pass exclusively through the scene-token bottleneck, encouraging scene tokens to preserve richer and less redundant visual representations for planning. We further introduce semantic priors derived from foundation-model annotations as a weak semantic interface biasing reconstruction targets toward driving-related structures without introducing explicit perception heads. All auxiliary reconstruction components are removed at inference time, leaving the deployed planner unchanged. NTR achieves state-of-the-art performance on three public autonomous driving benchmarks, including 8.0461 RFS on Waymo E2E and 94.1 PDMS / 90.9 EPDMS on NavSim1&2. The learned scene tokens exhibit lower pairwise redundancy and higher effective rank, indicating that effective bottleneck supervision improves both compact visual representation learning and planning performance.

ROAug 18, 2025Code
Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

Rui Shao, Wei Li, Lingsen Zhang et al.

Robotic manipulation, a key frontier in robotics and embodied AI, requires precise motor control and multimodal understanding, yet traditional rule-based methods fail to scale or generalize in unstructured, novel environments. In recent years, Vision-Language-Action (VLA) models, built upon Large Vision-Language Models (VLMs) pretrained on vast image-text datasets, have emerged as a transformative paradigm. This survey provides the first systematic, taxonomy-oriented review of large VLM-based VLA models for robotic manipulation. We begin by clearly defining large VLM-based VLA models and delineating two principal architectural paradigms: (1) monolithic models, encompassing single-system and dual-system designs with differing levels of integration; and (2) hierarchical models, which explicitly decouple planning from execution via interpretable intermediate representations. Building on this foundation, we present an in-depth examination of large VLM-based VLA models: (1) integration with advanced domains, including reinforcement learning, training-free optimization, learning from human videos, and world model integration; (2) synthesis of distinctive characteristics, consolidating architectural traits, operational strengths, and the datasets and benchmarks that support their development; (3) identification of promising directions, including memory mechanisms, 4D perception, efficient adaptation, multi-agent cooperation, and other emerging capabilities. This survey consolidates recent advances to resolve inconsistencies in existing taxonomies, mitigate research fragmentation, and fill a critical gap through the systematic integration of studies at the intersection of large VLMs and robotic manipulation. We provide a regularly updated project page to document ongoing progress: https://github.com/JiuTian-VL/Large-VLM-based-VLA-for-Robotic-Manipulation

CVMar 30, 2022Code
PEGG-Net: Pixel-Wise Efficient Grasp Generation in Complex Scenes

Haozhe Wang, Zhiyang Liu, Lei Zhou et al.

Vision-based grasp estimation is an essential part of robotic manipulation tasks in the real world. Existing planar grasp estimation algorithms have been demonstrated to work well in relatively simple scenes. But when it comes to complex scenes, such as cluttered scenes with messy backgrounds and moving objects, the algorithms from previous works are prone to generate inaccurate and unstable grasping contact points. In this work, we first study the existing planar grasp estimation algorithms and analyze the related challenges in complex scenes. Secondly, we design a Pixel-wise Efficient Grasp Generation Network (PEGG-Net) to tackle the problem of grasping in complex scenes. PEGG-Net can achieve improved state-of-the-art performance on the Cornell dataset (98.9%) and second-best performance on the Jacquard dataset (93.8%), outperforming other existing algorithms without the introduction of complex structures. Thirdly, PEGG-Net could operate in a closed-loop manner for added robustness in dynamic environments using position-based visual servoing (PBVS). Finally, we conduct real-world experiments on static, dynamic, and cluttered objects in different complex scenes. The results show that our proposed network achieves a high success rate in grasping irregular objects, household objects, and workshop tools. To benefit the community, our trained model and supplementary materials are available at https://github.com/HZWang96/PEGG-Net.

14.2CVApr 30
3D Reconstruction Techniques in the Manufacturing Domain: Applications, Research Opportunities and Use Cases

Chialoon Cheng, Kaijun liu, Zhiyang Liu et al.

This comprehensive review examines the evolution and the current state of the art in three-dimensional (3D) reconstruction techniques in manufacturing applications. The analysis covers both traditional approaches and emerging deep learning methods, showing a critical research gap in unified 3d reconstruction frameworks. Through systematic review of 106 recent publications, we classify reconstruction techniques into three primary categories: data acquisition, point cloud generation, post-processing and applications. Non-contact methods, particularly structured light scanning and stereo vision, have shown significant adoption in manufacturing, with 47% of surveyed applications focusing on quality inspection. The integration of deep learning has enhanced reconstruction accuracy and processing speed, particularly in feature extraction and matching. Key applications span design and development (13%), machining (8%), process (17%), assembly (22%), and quality inspection (40%). While current technologies achieve sub-millimeter accuracy in controlled environments, challenges persist in handling reflective surfaces and dynamic environments. Our findings indicate a trend toward hybrid systems combining multiple sensor types and processing methods to overcome individual limitations. This survey provides a structured framework for understanding current capabilities and future directions in manufacturing-focused 3D reconstruction.

CVApr 4, 2024
You Only Scan Once: A Dynamic Scene Reconstruction Pipeline for 6-DoF Robotic Grasping of Novel Objects

Lei Zhou, Haozhe Wang, Zhengshen Zhang et al.

In the realm of robotic grasping, achieving accurate and reliable interactions with the environment is a pivotal challenge. Traditional methods of grasp planning methods utilizing partial point clouds derived from depth image often suffer from reduced scene understanding due to occlusion, ultimately impeding their grasping accuracy. Furthermore, scene reconstruction methods have primarily relied upon static techniques, which are susceptible to environment change during manipulation process limits their efficacy in real-time grasping tasks. To address these limitations, this paper introduces a novel two-stage pipeline for dynamic scene reconstruction. In the first stage, our approach takes scene scanning as input to register each target object with mesh reconstruction and novel object pose tracking. In the second stage, pose tracking is still performed to provide object poses in real-time, enabling our approach to transform the reconstructed object point clouds back into the scene. Unlike conventional methodologies, which rely on static scene snapshots, our method continuously captures the evolving scene geometry, resulting in a comprehensive and up-to-date point cloud representation. By circumventing the constraints posed by occlusion, our method enhances the overall grasp planning process and empowers state-of-the-art 6-DoF robotic grasping algorithms to exhibit markedly improved accuracy.

73.1ROApr 7
Precise Aggressive Aerial Maneuvers with Sensorimotor Policies

Tianyue Wu, Guangtong Xu, Zihan Wang et al.

Precise aggressive maneuvers with lightweight onboard sensors remains a key bottleneck in fully exploiting the maneuverability of drones. Such maneuvers are critical for expanding the systems' accessible area by navigating through narrow openings in the environment. Among the most relevant problems, a representative one is aggressive traversal through narrow gaps with quadrotors under SE(3) constraints, which require the quadrotors to leverage a momentary tilted attitude and the asymmetry of the airframe to navigate through gaps. In this paper, we achieve such maneuvers by developing sensorimotor policies directly mapping onboard vision and proprioception into low-level control commands. The policies are trained using reinforcement learning (RL) with end-to-end policy distillation in simulation. We mitigate the fundamental hardness of model-free RL's exploration on the restricted solution space with an initialization strategy leveraging trajectories generated by a model-based planner. Careful sim-to-real design allows the policy to control a quadrotor through narrow gaps with low clearances and high repeatability. For instance, the proposed method enables a quadrotor to navigate a rectangular gap at a 5 cm clearance, tilted at up to 90-degree orientation, without knowledge of the gap's position or orientation. Without training on dynamic gaps, the policy can reactively servo the quadrotor to traverse through a moving gap. The proposed method is also validated by training and deploying policies on challenging tracks of narrow gaps placed closely. The flexibility of the policy learning method is demonstrated by developing policies for geometrically diverse gaps, without relying on manually defined traversal poses and visual features.

CVMar 25, 2025
SeLIP: Similarity Enhanced Contrastive Language Image Pretraining for Multi-modal Head MRI

Zhiyang Liu, Dong Yang, Minghao Zhang et al.

Despite that deep learning (DL) methods have presented tremendous potential in many medical image analysis tasks, the practical applications of medical DL models are limited due to the lack of enough data samples with manual annotations. By noting that the clinical radiology examinations are associated with radiology reports that describe the images, we propose to develop a foundation model for multi-model head MRI by using contrastive learning on the images and the corresponding radiology findings. In particular, a contrastive learning framework is proposed, where a mixed syntax and semantic similarity matching metric is integrated to reduce the thirst of extreme large dataset in conventional contrastive learning framework. Our proposed similarity enhanced contrastive language image pretraining (SeLIP) is able to effectively extract more useful features. Experiments revealed that our proposed SeLIP performs well in many downstream tasks including image-text retrieval task, classification task, and image segmentation, which highlights the importance of considering the similarities among texts describing different images in developing medical image foundation models.

IVAug 3, 2020
Automated Segmentation of Brain Gray Matter Nuclei on Quantitative Susceptibility Mapping Using Deep Convolutional Neural Network

Chao Chai, Pengchong Qiao, Bin Zhao et al.

Abnormal iron accumulation in the brain subcortical nuclei has been reported to be correlated to various neurodegenerative diseases, which can be measured through the magnetic susceptibility from the quantitative susceptibility mapping (QSM). To quantitively measure the magnetic susceptibility, the nuclei should be accurately segmented, which is a tedious task for clinicians. In this paper, we proposed a double-branch residual-structured U-Net (DB-ResUNet) based on 3D convolutional neural network (CNN) to automatically segment such brain gray matter nuclei. To better tradeoff between segmentation accuracy and the memory efficiency, the proposed DB-ResUNet fed image patches with high resolution and the patches with low resolution but larger field of view into the local and global branches, respectively. Experimental results revealed that by jointly using QSM and T$_\text{1}$ weighted imaging (T$_\text{1}$WI) as inputs, the proposed method was able to achieve better segmentation accuracy over its single-branch counterpart, as well as the conventional atlas-based method and the classical 3D-UNet structure. The susceptibility values and the volumes were also measured, which indicated that the measurements from the proposed DB-ResUNet are able to present high correlation with values from the manually annotated regions of interest.

ITFeb 28, 2020
Fast and Accurate Optical Fiber Channel Modeling Using Generative Adversarial Network

Hang Yang, Zekun Niu, Shilin Xiao et al.

In this work, a new data-driven fiber channel modeling method, generative adversarial network (GAN) is investigated to learn the distribution of fiber channel transfer function. Our investigation focuses on joint channel effects of attenuation, chromic dispersion, self-phase modulation (SPM), and amplified spontaneous emission (ASE) noise. To achieve the success of GAN for channel modeling, we modify the loss function, design the condition vector of input and address the mode collapse for the long-haul transmission. The effective architecture, parameters, and training skills of GAN are also displayed in the paper. The results show that the proposed method can learn the accurate transfer function of the fiber channel. The transmission distance of modeling can be up to 1000 km and can be extended to arbitrary distance theoretically. Moreover, GAN shows robust generalization abilities under different optical launch powers, modulation formats, and input signal distributions. Comparing the complexity of GAN with the split-step Fourier method (SSFM), the total multiplication number is only 2% of SSFM and the running time is less than 0.1 seconds for 1000-km transmission, versus 400 seconds using the SSFM under the same hardware and software conditions, which highlights the remarkable reduction in complexity of the fiber channel modeling.

IVAug 10, 2019
Automatic acute ischemic stroke lesion segmentation using semi-supervised learning

Bin Zhao, Shuxue Ding, Hong Wu et al.

Ischemic stroke is a common disease in the elderly population, which can cause long-term disability and even death. However, the time window for treatment of ischemic stroke in its acute stage is very short. To fast localize and quantitively evaluate the acute ischemic stroke (AIS) lesions, many deep-learning-based lesion segmentation methods have been proposed in the literature, where a deep convolutional neural network (CNN) was trained on hundreds of fully labeled subjects with accurate annotations of AIS lesions. Despite that high segmentation accuracy can be achieved, the accurate labels should be annotated by experienced clinicians, and it is therefore very time-consuming to obtain a large number of fully labeled subjects. In this paper, we propose a semi-supervised method to automatically segment AIS lesions in diffusion weighted images and apparent diffusion coefficient maps. By using a large number of weakly labeled subjects and a small number of fully labeled subjects, our proposed method is able to accurately detect and segment the AIS lesions. In particular, our proposed method consists of three parts: 1) a double-path classification net (DPC-Net) trained in a weakly-supervised way is used to detect the suspicious regions of AIS lesions; 2) a pixel-level K-Means clustering algorithm is used to identify the hyperintensive regions on the DWIs; and 3) a region-growing algorithm combines the outputs of the DPC-Net and the K-Means to obtain the final precise lesion segmentation. In our experiment, we use 460 weakly labeled subjects and 15 fully labeled subjects to train and fine-tune the proposed method. By evaluating on a clinical dataset with 150 fully labeled subjects, our proposed method achieves a mean dice coefficient of 0.642, and a lesion-wise F1 score of 0.822.

CVMar 5, 2018
Towards Clinical Diagnosis: Automated Stroke Lesion Segmentation on Multimodal MR Image Using Convolutional Neural Network

Zhiyang Liu, Chen Cao, Shuxue Ding et al.

The patient with ischemic stroke can benefit most from the earliest possible definitive diagnosis. While the high quality medical resources are quite scarce across the globe, an automated diagnostic tool is expected in analyzing the magnetic resonance (MR) images to provide reference in clinical diagnosis. In this paper, we propose a deep learning method to automatically segment ischemic stroke lesions from multi-modal MR images. By using atrous convolution and global convolution network, our proposed residual-structured fully convolutional network (Res-FCN) is able to capture features from large receptive fields. The network architecture is validated on a large dataset of 212 clinically acquired multi-modal MR images, which is shown to achieve a mean dice coefficient of 0.645 with a mean number of false negative lesions of 1.515. The false negatives can reach a value that close to a common medical image doctor, making it exceptive for a real clinical application.