Yihang Zhou

CV
h-index30
19papers
106citations
Novelty49%
AI Score54

19 Papers

ROFeb 9Code
Contact-Anchored Policies: Contact Conditioning Creates Strong Robot Utility Models

Zichen Jeff Cui, Omar Rayyan, Haritheja Etukuru et al.

The prevalent paradigm in robot learning attempts to generalize across environments, embodiments, and tasks with language prompts at runtime. A fundamental tension limits this approach: language is often too abstract to guide the concrete physical understanding required for robust manipulation. In this work, we introduce Contact-Anchored Policies (CAP), which replace language conditioning with points of physical contact in space. Simultaneously, we structure CAP as a library of modular utility models rather than a monolithic generalist policy. This factorization allows us to implement a real-to-sim iteration cycle: we build EgoGym, a lightweight simulation benchmark, to rapidly identify failure modes and refine our models and datasets prior to real-world deployment. We show that by conditioning on contact and iterating via simulation, CAP generalizes to novel environments and embodiments out of the box on three fundamental manipulation skills while using only 23 hours of demonstration data, and outperforms large, state-of-the-art VLAs in zero-shot evaluations by 56%. All model checkpoints, codebase, hardware, simulation, and datasets will be open-sourced. Project page: https://cap-policy.github.io/

96.5BMMay 15Code
Ligand-Conditioned Discrete Diffusion for Protein Sequence-Structure Co-Design

Chen Wei, Fanding Xu, Minghao Sun et al.

Proteins perform their biological functions through three-dimensional structures encoded by amino acid sequences, and ligand-binding protein co-design requires models that generate sequence-structure compatible proteins under explicit ligand constraints. Although continuous diffusion and flow-based models support ligand-aware design in coordinate or latent spaces, existing discrete diffusion protein language models mainly operate over sequence or structure tokens without direct small-molecule conditioning. We introduce \textbf{ProtLiD$^2$}, a \textbf{Prot}ein \textbf{L}igand-conditioned \textbf{D}iscrete \textbf{D}iffusion model for protein sequence-structure co-design. ProtLiD$^2$ jointly generates amino-acid sequence and discrete structure tokens while incorporating ligand chemical and geometric information through geometry-aware cross-attention. Trained on over one million ligand-protein complexes, ProtLiD$^2$ extends masked discrete diffusion to ligand-aware functional protein design. We further propose maximum confidence-margin guided ReMask decoding, an inference-time self-correction strategy that retains confident predictions and remasks uncertain tokens. ProtLiD$^2$ improves global fold confidence over Complexa in whole-protein design, increasing TM-score from 0.672 to 0.802 and pLDDT from 64.55 to 73.00. In pocket co-design, ProtLiD$^2$ reduces active-site BB-RMSD from 3.46/3.40Å for FAIR/PocketGen to 1.97Å, and improves ligand-aware pass rates over PocketGen from 14.86% to 59.73% and from 6.08% to 23.49% under stricter docking thresholds. These results support ligand-conditioned discrete diffusion as an effective token-space framework for functional protein co-design. Code will be available at https://github.com/auroua/ProtLiD.

IVNov 24, 2023
Joint Diffusion: Mutual Consistency-Driven Diffusion Model for PET-MRI Co-Reconstruction

Taofeng Xie, Zhuo-Xu Cui, Chen Luo et al.

Positron Emission Tomography and Magnetic Resonance Imaging (PET-MRI) systems can obtain functional and anatomical scans. PET suffers from a low signal-to-noise ratio. Meanwhile, the k-space data acquisition process in MRI is time-consuming. The study aims to accelerate MRI and enhance PET image quality. Conventional approaches involve the separate reconstruction of each modality within PET-MRI systems. However, there exists complementary information among multi-modal images. The complementary information can contribute to image reconstruction. In this study, we propose a novel PET-MRI joint reconstruction model employing a mutual consistency-driven diffusion mode, namely MC-Diffusion. MC-Diffusion learns the joint probability distribution of PET and MRI for utilizing complementary information. We conducted a series of contrast experiments about LPLS, Joint ISAT-net and MC-Diffusion by the ADNI dataset. The results underscore the qualitative and quantitative improvements achieved by MC-Diffusion, surpassing the state-of-the-art method.

CVAug 30, 2023
Physics-Informed DeepMRI: Bridging the Gap from Heat Diffusion to k-Space Interpolation

Zhuo-Xu Cui, Congcong Liu, Xiaohong Fan et al.

In the field of parallel imaging (PI), alongside image-domain regularization methods, substantial research has been dedicated to exploring $k$-space interpolation. However, the interpretability of these methods remains an unresolved issue. Furthermore, these approaches currently face acceleration limitations that are comparable to those experienced by image-domain methods. In order to enhance interpretability and overcome the acceleration limitations, this paper introduces an interpretable framework that unifies both $k$-space interpolation techniques and image-domain methods, grounded in the physical principles of heat diffusion equations. Building upon this foundational framework, a novel $k$-space interpolation method is proposed. Specifically, we model the process of high-frequency information attenuation in $k$-space as a heat diffusion equation, while the effort to reconstruct high-frequency information from low-frequency regions can be conceptualized as a reverse heat equation. However, solving the reverse heat equation poses a challenging inverse problem. To tackle this challenge, we modify the heat equation to align with the principles of magnetic resonance PI physics and employ the score-based generative method to precisely execute the modified reverse heat diffusion. Finally, experimental validation conducted on publicly available datasets demonstrates the superiority of the proposed approach over traditional $k$-space interpolation methods, deep learning-based $k$-space interpolation methods, and conventional diffusion models in terms of reconstruction accuracy, particularly in high-frequency regions.

IVAug 8, 2024
Pediatric TSC-Related Epilepsy Classification from Clinical MR Images Using Quantum Neural Network

Ling Lin, Yihang Zhou, Zhanqi Hu et al.

Tuberous sclerosis complex (TSC) manifests as a multisystem disorder with significant neurological implications. This study addresses the critical need for robust classification models tailored to TSC in pediatric patients, introducing QResNet,a novel deep learning model seamlessly integrating conventional convolutional neural networks with quantum neural networks. The model incorporates a two-layer quantum layer (QL), comprising ZZFeatureMap and Ansatz layers, strategically designed for processing classical data within a quantum framework. A comprehensive evaluation, demonstrates the superior performance of QResNet in TSC MRI image classification compared to conventional 3D-ResNet models. These compelling findings underscore the potential of quantum computing to revolutionize medical imaging and diagnostics.Remarkably, this method surpasses conventional CNNs in accuracy and Area Under the Curve (AUC) metrics with the current dataset. Future research endeavors may focus on exploring the scalability and practical implementation of quantum algorithms in real-world medical imaging scenarios.

8.5CVMar 13
Rooftop Wind Field Reconstruction Using Sparse Sensors: From Deterministic to Generative Learning Methods

Yihang Zhou, Chao Lin, Hideki Kikumoto et al.

Real-time rooftop wind-speed distribution is important for the safe operation of drones and urban air mobility systems, wind control systems, and rooftop utilization. However, rooftop flows show strong nonlinearity, separation, and cross-direction variability, which make flow field reconstruction from sparse sensors difficult. This study develops a learning-from-observation framework using wind-tunnel experimental data obtained by Particle Image Velocimetry (PIV) and compares Kriging interpolation with three deep learning models: UNet, Vision Transformer Autoencoder (ViTAE), and Conditional Wasserstein GAN (CWGAN). We evaluate two training strategies, single wind-direction training (SDT) and mixed wind-direction training (MDT), across sensor densities from 5 to 30, test robustness under sensor position perturbations of plus or minus 1 grid, and optimize sensor placement via Proper Orthogonal Decomposition with QR decomposition. Results show that deep learning methods can reconstruct rooftop wind fields from sparse sensor data effectively. Compared with Kriging interpolation, the deep learning models improved SSIM by up to 32.7%, FAC2 by 24.2%, and NMSE by 27.8%. Mixed wind-direction training further improved performance, with gains of up to 173.7% in SSIM, 16.7% in FAC2, and 98.3% in MG compared with single-direction training. The results also show that sensor configuration, optimization, and training strategy should be considered jointly for reliable deployment. QR-based optimization improved robustness by up to 27.8% under sensor perturbations, although with metric-dependent trade-offs. Training on experimental rather than simulated data also provides practical guidance for method selection and sensor placement in different scenarios.

IVAug 5, 2023
Dynamic Dual-Graph Fusion Convolutional Network For Alzheimer's Disease Diagnosis

Fanshi Li, Zhihui Wang, Yifan Guo et al.

In this paper, a dynamic dual-graph fusion convolutional network is proposed to improve Alzheimer's disease (AD) diagnosis performance. The following are the paper's main contributions: (a) propose a novel dynamic GCN architecture, which is an end-to-end pipeline for diagnosis of the AD task; (b) the proposed architecture can dynamically adjust the graph structure for GCN to produce better diagnosis outcomes by learning the optimal underlying latent graph; (c) incorporate feature graph learning and dynamic graph learning, giving those useful features of subjects more weight while decreasing the weights of other noise features. Experiments indicate that our model provides flexibility and stability while achieving excellent classification results in AD diagnosis.

CVAug 7, 2024
Joint PET-MRI Reconstruction with Diffusion Stochastic Differential Model

Taofeng Xie, Zhuoxu Cui, Congcong Liu et al.

PET suffers from a low signal-to-noise ratio. Meanwhile, the k-space data acquisition process in MRI is time-consuming by PET-MRI systems. We aim to accelerate MRI and improve PET image quality. This paper proposed a novel joint reconstruction model by diffusion stochastic differential equations based on learning the joint probability distribution of PET and MRI. Compare the results underscore the qualitative and quantitative improvements our model brings to PET and MRI reconstruction, surpassing the current state-of-the-art methodologies. Joint PET-MRI reconstruction is a challenge in the PET-MRI system. This studies focused on the relationship extends beyond edges. In this study, PET is generated from MRI by learning joint probability distribution as the relationship.

CRSep 25, 2024
A Survey for Deep Reinforcement Learning Based Network Intrusion Detection

Wanrong Yang, Alberto Acuto, Yihang Zhou et al.

Cyber-attacks are becoming increasingly sophisticated and frequent, highlighting the importance of network intrusion detection systems. This paper explores the potential and challenges of using deep reinforcement learning (DRL) in network intrusion detection. It begins by introducing key DRL concepts and frameworks, such as deep Q-networks and actor-critic algorithms, and reviews recent research utilizing DRL for intrusion detection. The study evaluates challenges related to model training efficiency, detection of minority and unknown class attacks, feature selection, and handling unbalanced datasets. The performance of DRL models is comprehensively analyzed, showing that while DRL holds promise, many recent technologies remain underexplored. Some DRL models achieve state-of-the-art results on public datasets, occasionally outperforming traditional deep learning methods. The paper concludes with recommendations for enhancing DRL deployment and testing in real-world network scenarios, with a focus on Internet of Things intrusion detection. It discusses recent DRL architectures and suggests future policy functions for DRL-based intrusion detection. Finally, the paper proposes integrating DRL with generative methods to further improve performance, addressing current gaps and supporting more robust and adaptive network intrusion detection systems.

BMOct 29, 2025Code
EnzyControl: Adding Functional and Substrate-Specific Control for Enzyme Backbone Generation

Chao Song, Zhiyuan Liu, Han Huang et al.

Designing enzyme backbones with substrate-specific functionality is a critical challenge in computational protein engineering. Current generative models excel in protein design but face limitations in binding data, substrate-specific control, and flexibility for de novo enzyme backbone generation. To address this, we introduce EnzyBind, a dataset with 11,100 experimentally validated enzyme-substrate pairs specifically curated from PDBbind. Building on this, we propose EnzyControl, a method that enables functional and substrate-specific control in enzyme backbone generation. Our approach generates enzyme backbones conditioned on MSA-annotated catalytic sites and their corresponding substrates, which are automatically extracted from curated enzyme-substrate data. At the core of EnzyControl is EnzyAdapter, a lightweight, modular component integrated into a pretrained motif-scaffolding model, allowing it to become substrate-aware. A two-stage training paradigm further refines the model's ability to generate accurate and functional enzyme structures. Experiments show that our EnzyControl achieves the best performance across structural and functional metrics on EnzyBind and EnzyBench benchmarks, with particularly notable improvements of 13\% in designability and 13\% in catalytic efficiency compared to the baseline models. The code is released at https://github.com/Vecteur-libre/EnzyControl.

CVFeb 26, 2024
COMAE: COMprehensive Attribute Exploration for Zero-shot Hashing

Yuqi Li, Qingqing Long, Yihang Zhou et al.

Zero-shot hashing (ZSH) has shown excellent success owing to its efficiency and generalization in large-scale retrieval scenarios. While considerable success has been achieved, there still exist urgent limitations. Existing works ignore the locality relationships of representations and attributes, which have effective transferability between seeable classes and unseeable classes. Also, the continuous-value attributes are not fully harnessed. In response, we conduct a COMprehensive Attribute Exploration for ZSH, named COMAE, which depicts the relationships from seen classes to unseen ones through three meticulously designed explorations, i.e., point-wise, pair-wise and class-wise consistency constraints. By regressing attributes from the proposed attribute prototype network, COMAE learns the local features that are relevant to the visual attributes. Then COMAE utilizes contrastive learning to comprehensively depict the context of attributes, rather than instance-independent optimization. Finally, the class-wise constraint is designed to cohesively learn the hash code, image representation, and visual attributes more effectively. Experimental results on the popular ZSH datasets demonstrate that COMAE outperforms state-of-the-art hashing techniques, especially in scenarios with a larger number of unseen label classes.

CVNov 16, 2022
Yield Evaluation of Citrus Fruits based on the YoloV5 compressed by Knowledge Distillation

Yuqi Li, Yuting He, Yihang Zhou et al.

In the field of planting fruit trees, pre-harvest estimation of fruit yield is important for fruit storage and price evaluation. However, considering the cost, the yield of each tree cannot be assessed by directly picking the immature fruit. Therefore, the problem is a very difficult task. In this paper, a fruit counting and yield assessment method based on computer vision is proposed for citrus fruit trees as an example. Firstly, images of single fruit trees from different angles are acquired and the number of fruits is detected using a deep Convolutional Neural Network model YOLOv5, and the model is compressed using a knowledge distillation method. Then, a linear regression method is used to model yield-related features and evaluate yield. Experiments show that the proposed method can accurately count fruits and approximate the yield.

CVOct 31, 2024
Image Synthesis with Class-Aware Semantic Diffusion Models for Surgical Scene Segmentation

Yihang Zhou, Rebecca Towning, Zaid Awad et al.

Surgical scene segmentation is essential for enhancing surgical precision, yet it is frequently compromised by the scarcity and imbalance of available data. To address these challenges, semantic image synthesis methods based on generative adversarial networks and diffusion models have been developed. However, these models often yield non-diverse images and fail to capture small, critical tissue classes, limiting their effectiveness. In response, we propose the Class-Aware Semantic Diffusion Model (CASDM), a novel approach which utilizes segmentation maps as conditions for image synthesis to tackle data scarcity and imbalance. Novel class-aware mean squared error and class-aware self-perceptual loss functions have been defined to prioritize critical, less visible classes, thereby enhancing image quality and relevance. Furthermore, to our knowledge, we are the first to generate multi-class segmentation maps using text prompts in a novel fashion to specify their contents. These maps are then used by CASDM to generate surgical scene images, enhancing datasets for training and validating segmentation models. Our evaluation, which assesses both image quality and downstream segmentation performance, demonstrates the strong effectiveness and generalisability of CASDM in producing realistic image-map pairs, significantly advancing surgical scene segmentation across diverse and challenging datasets.

CVMar 18, 2025
Comparative and Interpretative Analysis of CNN and Transformer Models in Predicting Wildfire Spread Using Remote Sensing Data

Yihang Zhou, Ruige Kong, Zhengsen Xu et al.

Facing the escalating threat of global wildfires, numerous computer vision techniques using remote sensing data have been applied in this area. However, the selection of deep learning methods for wildfire prediction remains uncertain due to the lack of comparative analysis in a quantitative and explainable manner, crucial for improving prevention measures and refining models. This study aims to thoroughly compare the performance, efficiency, and explainability of four prevalent deep learning architectures: Autoencoder, ResNet, UNet, and Transformer-based Swin-UNet. Employing a real-world dataset that includes nearly a decade of remote sensing data from California, U.S., these models predict the spread of wildfires for the following day. Through detailed quantitative comparison analysis, we discovered that Transformer-based Swin-UNet and UNet generally outperform Autoencoder and ResNet, particularly due to the advanced attention mechanisms in Transformer-based Swin-UNet and the efficient use of skip connections in both UNet and Transformer-based Swin-UNet, which contribute to superior predictive accuracy and model interpretability. Then we applied XAI techniques on all four models, this not only enhances the clarity and trustworthiness of models but also promotes focused improvements in wildfire prediction capabilities. The XAI analysis reveals that UNet and Transformer-based Swin-UNet are able to focus on critical features such as 'Previous Fire Mask', 'Drought', and 'Vegetation' more effectively than the other two models, while also maintaining balanced attention to the remaining features, leading to their superior performance. The insights from our thorough comparative analysis offer substantial implications for future model design and also provide guidance for model selection in different scenarios.

CVNov 21, 2025
Flow-Guided Implicit Neural Representation for Motion-Aware Dynamic MRI Reconstruction

Baoqing Li, Yuanyuan Liu, Congcong Liu et al.

Dynamic magnetic resonance imaging (dMRI) captures temporally-resolved anatomy but is often challenged by limited sampling and motion-induced artifacts. Conventional motion-compensated reconstructions typically rely on pre-estimated optical flow, which is inaccurate under undersampling and degrades reconstruction quality. In this work, we propose a novel implicit neural representation (INR) framework that jointly models both the dynamic image sequence and its underlying motion field. Specifically, one INR is employed to parameterize the spatiotemporal image content, while another INR represents the optical flow. The two are coupled via the optical flow equation, which serves as a physics-inspired regularization, in addition to a data consistency loss that enforces agreement with k-space measurements. This joint optimization enables simultaneous recovery of temporally coherent images and motion fields without requiring prior flow estimation. Experiments on dynamic cardiac MRI datasets demonstrate that the proposed method outperforms state-of-the-art motion-compensated and deep learning approaches, achieving superior reconstruction quality, accurate motion estimation, and improved temporal fidelity. These results highlight the potential of implicit joint modeling with flow-regularized constraints for advancing dMRI reconstruction.

CVOct 8, 2025
Self-supervised Deep Unrolled Model with Implicit Neural Representation Regularization for Accelerating MRI Reconstruction

Jingran Xu, Yuanyuan Liu, Yuanbiao Yang et al.

Magnetic resonance imaging (MRI) is a vital clinical diagnostic tool, yet its application is limited by prolonged scan times. Accelerating MRI reconstruction addresses this issue by reconstructing high-fidelity MR images from undersampled k-space measurements. In recent years, deep learning-based methods have demonstrated remarkable progress. However, most methods rely on supervised learning, which requires large amounts of fully-sampled training data that are difficult to obtain. This paper proposes a novel zero-shot self-supervised reconstruction method named UnrollINR, which enables scan-specific MRI reconstruction without external training data. UnrollINR adopts a physics-guided unrolled reconstruction architecture and introduces implicit neural representation (INR) as a regularization prior to effectively constrain the solution space. This method overcomes the local bias limitation of CNNs in traditional deep unrolled methods and avoids the instability associated with relying solely on INR's implicit regularization in highly ill-posed scenarios. Consequently, UnrollINR significantly improves MRI reconstruction performance under high acceleration rates. Experimental results show that even at a high acceleration rate of 10, UnrollINR achieves superior reconstruction performance compared to supervised and self-supervised learning methods, validating its effectiveness and superiority.

CVOct 3, 2025
HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion

Shiyi Zhang, Dong Liang, Hairong Zheng et al.

The reconstruction of visual information from brain activity fosters interdisciplinary integration between neuroscience and computer vision. However, existing methods still face challenges in accurately recovering highly complex visual stimuli. This difficulty stems from the characteristics of natural scenes: low-level features exhibit heterogeneity, while high-level features show semantic entanglement due to contextual overlaps. Inspired by the hierarchical representation theory of the visual cortex, we propose the HAVIR model, which separates the visual cortex into two hierarchical regions and extracts distinct features from each. Specifically, the Structural Generator extracts structural information from spatial processing voxels and converts it into latent diffusion priors, while the Semantic Extractor converts semantic processing voxels into CLIP embeddings. These components are integrated via the Versatile Diffusion model to synthesize the final image. Experimental results demonstrate that HAVIR enhances both the structural and semantic quality of reconstructions, even in complex scenes, and outperforms existing models.

CVOct 2, 2025
NeuroSwift: A Lightweight Cross-Subject Framework for fMRI Visual Reconstruction of Complex Scenes

Shiyi Zhang, Dong Liang, Yihang Zhou

Reconstructing visual information from brain activity via computer vision technology provides an intuitive understanding of visual neural mechanisms. Despite progress in decoding fMRI data with generative models, achieving accurate cross-subject reconstruction of visual stimuli remains challenging and computationally demanding. This difficulty arises from inter-subject variability in neural representations and the brain's abstract encoding of core semantic features in complex visual inputs. To address these challenges, we propose NeuroSwift, which integrates complementary adapters via diffusion: AutoKL for low-level features and CLIP for semantics. NeuroSwift's CLIP Adapter is trained on Stable Diffusion generated images paired with COCO captions to emulate higher visual cortex encoding. For cross-subject generalization, we pretrain on one subject and then fine-tune only 17 percent of parameters (fully connected layers) for new subjects, while freezing other components. This enables state-of-the-art performance with only one hour of training per subject on lightweight GPUs (three RTX 4090), and it outperforms existing methods.

LGMay 6, 2023
Synthesizing PET images from High-field and Ultra-high-field MR images Using Joint Diffusion Attention Model

Taofeng Xie, Chentao Cao, Zhuoxu Cui et al.

MRI and PET are crucial diagnostic tools for brain diseases, as they provide complementary information on brain structure and function. However, PET scanning is costly and involves radioactive exposure, resulting in a lack of PET. Moreover, simultaneous PET and MRI at ultra-high-field are currently hardly infeasible. Ultra-high-field imaging has unquestionably proven valuable in both clinical and academic settings, especially in the field of cognitive neuroimaging. These motivate us to propose a method for synthetic PET from high-filed MRI and ultra-high-field MRI. From a statistical perspective, the joint probability distribution (JPD) is the most direct and fundamental means of portraying the correlation between PET and MRI. This paper proposes a novel joint diffusion attention model which has the joint probability distribution and attention strategy, named JDAM. JDAM has a diffusion process and a sampling process. The diffusion process involves the gradual diffusion of PET to Gaussian noise by adding Gaussian noise, while MRI remains fixed. JPD of MRI and noise-added PET was learned in the diffusion process. The sampling process is a predictor-corrector. PET images were generated from MRI by JPD of MRI and noise-added PET. The predictor is a reverse diffusion process and the corrector is Langevin dynamics. Experimental results on the public Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset demonstrate that the proposed method outperforms state-of-the-art CycleGAN for high-field MRI (3T MRI). Finally, synthetic PET images from the ultra-high-field (5T MRI and 7T MRI) be attempted, providing a possibility for ultra-high-field PET-MRI imaging.