Zilong Li

CV
h-index36
16papers
357citations
Novelty55%
AI Score60

16 Papers

IVApr 4, 2023Code
CoreDiff: Contextual Error-Modulated Generalized Diffusion Model for Low-Dose CT Denoising and Generalization

Qi Gao, Zilong Li, Junping Zhang et al.

Low-dose computed tomography (CT) images suffer from noise and artifacts due to photon starvation and electronic noise. Recently, some works have attempted to use diffusion models to address the over-smoothness and training instability encountered by previous deep-learning-based denoising models. However, diffusion models suffer from long inference times due to the large number of sampling steps involved. Very recently, cold diffusion model generalizes classical diffusion models and has greater flexibility. Inspired by the cold diffusion, this paper presents a novel COntextual eRror-modulated gEneralized Diffusion model for low-dose CT (LDCT) denoising, termed CoreDiff. First, CoreDiff utilizes LDCT images to displace the random Gaussian noise and employs a novel mean-preserving degradation operator to mimic the physical process of CT degradation, significantly reducing sampling steps thanks to the informative LDCT images as the starting point of the sampling process. Second, to alleviate the error accumulation problem caused by the imperfect restoration operator in the sampling process, we propose a novel ContextuaL Error-modulAted Restoration Network (CLEAR-Net), which can leverage contextual information to constrain the sampling process from structural distortion and modulate time step embedding features for better alignment with the input at the next time step. Third, to rapidly generalize to a new, unseen dose level with as few resources as possible, we devise a one-shot learning framework to make CoreDiff generalize faster and better using only a single LDCT image (un)paired with NDCT. Extensive experimental results on two datasets demonstrate that our CoreDiff outperforms competing methods in denoising and generalization performance, with a clinically acceptable inference time. Source code is made available at https://github.com/qgao21/CoreDiff.

IVJul 12, 2023Code
FreeSeed: Frequency-band-aware and Self-guided Network for Sparse-view CT Reconstruction

Chenglong Ma, Zilong Li, Junping Zhang et al.

Sparse-view computed tomography (CT) is a promising solution for expediting the scanning process and mitigating radiation exposure to patients, the reconstructed images, however, contain severe streak artifacts, compromising subsequent screening and diagnosis. Recently, deep learning-based image post-processing methods along with their dual-domain counterparts have shown promising results. However, existing methods usually produce over-smoothed images with loss of details due to (1) the difficulty in accurately modeling the artifact patterns in the image domain, and (2) the equal treatment of each pixel in the loss function. To address these issues, we concentrate on the image post-processing and propose a simple yet effective FREquency-band-awarE and SElf-guidED network, termed FreeSeed, which can effectively remove artifact and recover missing detail from the contaminated sparse-view CT images. Specifically, we first propose a frequency-band-aware artifact modeling network (FreeNet), which learns artifact-related frequency-band attention in Fourier domain for better modeling the globally distributed streak artifact on the sparse-view CT images. We then introduce a self-guided artifact refinement network (SeedNet), which leverages the predicted artifact to assist FreeNet in continuing to refine the severely corrupted details. Extensive experiments demonstrate the superior performance of FreeSeed and its dual-domain counterpart over the state-of-the-art sparse-view CT reconstruction methods. Source code is made available at https://github.com/Masaaki-75/freeseed.

IVJul 24, 2022Code
Quad-Net: Quad-domain Network for CT Metal Artifact Reduction

Zilong Li, Qi Gao, Yaping Wu et al.

Metal implants and other high-density objects in patients introduce severe streaking artifacts in CT images, compromising image quality and diagnostic performance. Although various methods were developed for CT metal artifact reduction over the past decades, including the latest dual-domain deep networks, remaining metal artifacts are still clinically challenging in many cases. Here we extend the state-of-the-art dual-domain deep network approach into a quad-domain counterpart so that all the features in the sinogram, image, and their corresponding Fourier domains are synergized to eliminate metal artifacts optimally without compromising structural subtleties. Our proposed quad-domain network for MAR, referred to as Quad-Net, takes little additional computational cost since the Fourier transform is highly efficient, and works across the four receptive fields to learn both global and local features as well as their relations. Specifically, we first design a Sinogram-Fourier Restoration Network (SFR-Net) in the sinogram domain and its Fourier space to faithfully inpaint metal-corrupted traces. Then, we couple SFR-Net with an Image-Fourier Refinement Network (IFR-Net) which takes both an image and its Fourier spectrum to improve a CT image reconstructed from the SFR-Net output using cross-domain contextual information. Quad-Net is trained on clinical datasets to minimize a composite loss function. Quad-Net does not require precise metal masks, which is of great importance in clinical practice. Our experimental results demonstrate the superiority of Quad-Net over the state-of-the-art MAR methods quantitatively, visually, and statistically. The Quad-Net code is publicly available at https://github.com/longzilicart/Quad-Net.

IVAug 16, 2023Code
Learning to Distill Global Representation for Sparse-View CT

Zilong Li, Chenglong Ma, Jie Chen et al.

Sparse-view computed tomography (CT) -- using a small number of projections for tomographic reconstruction -- enables much lower radiation dose to patients and accelerated data acquisition. The reconstructed images, however, suffer from strong artifacts, greatly limiting their diagnostic value. Current trends for sparse-view CT turn to the raw data for better information recovery. The resultant dual-domain methods, nonetheless, suffer from secondary artifacts, especially in ultra-sparse view scenarios, and their generalization to other scanners/protocols is greatly limited. A crucial question arises: have the image post-processing methods reached the limit? Our answer is not yet. In this paper, we stick to image post-processing methods due to great flexibility and propose global representation (GloRe) distillation framework for sparse-view CT, termed GloReDi. First, we propose to learn GloRe with Fourier convolution, so each element in GloRe has an image-wide receptive field. Second, unlike methods that only use the full-view images for supervision, we propose to distill GloRe from intermediate-view reconstructed images that are readily available but not explored in previous literature. The success of GloRe distillation is attributed to two key components: representation directional distillation to align the GloRe directions, and band-pass-specific contrastive distillation to gain clinically important details. Extensive experiments demonstrate the superiority of the proposed GloReDi over the state-of-the-art methods, including dual-domain ones. The source code is available at https://github.com/longzilicart/GloReDi.

CVOct 15, 2023Code
LICO: Explainable Models with Language-Image Consistency

Yiming Lei, Zilong Li, Yangyang Li et al.

Interpreting the decisions of deep learning models has been actively studied since the explosion of deep neural networks. One of the most convincing interpretation approaches is salience-based visual interpretation, such as Grad-CAM, where the generation of attention maps depends merely on categorical labels. Although existing interpretation methods can provide explainable decision clues, they often yield partial correspondence between image and saliency maps due to the limited discriminative information from one-hot labels. This paper develops a Language-Image COnsistency model for explainable image classification, termed LICO, by correlating learnable linguistic prompts with corresponding visual features in a coarse-to-fine manner. Specifically, we first establish a coarse global manifold structure alignment by minimizing the distance between the distributions of image and language features. We then achieve fine-grained saliency maps by applying optimal transport (OT) theory to assign local feature maps with class-specific prompts. Extensive experimental results on eight benchmark datasets demonstrate that the proposed LICO achieves a significant improvement in generating more explainable attention maps in conjunction with existing interpretation methods such as Grad-CAM. Remarkably, LICO improves the classification performance of existing models without introducing any computational overhead during inference. Source code is made available at https://github.com/ymLeiFDU/LICO.

LGNov 21, 2022Code
From Node Interaction to Hop Interaction: New Effective and Scalable Graph Learning Paradigm

Jie Chen, Zilong Li, Yin Zhu et al.

Existing Graph Neural Networks (GNNs) follow the message-passing mechanism that conducts information interaction among nodes iteratively. While considerable progress has been made, such node interaction paradigms still have the following limitation. First, the scalability limitation precludes the broad application of GNNs in large-scale industrial settings since the node interaction among rapidly expanding neighbors incurs high computation and memory costs. Second, the over-smoothing problem restricts the discrimination ability of nodes, i.e., node representations of different classes will converge to indistinguishable after repeated node interactions. In this work, we propose a novel hop interaction paradigm to address these limitations simultaneously. The core idea is to convert the interaction target among nodes to pre-processed multi-hop features inside each node. We design a simple yet effective HopGNN framework that can easily utilize existing GNNs to achieve hop interaction. Furthermore, we propose a multi-task learning strategy with a self-supervised learning objective to enhance HopGNN. We conduct extensive experiments on 12 benchmark datasets in a wide range of domains, scales, and smoothness of graphs. Experimental results show that our methods achieve superior performance while maintaining high scalability and efficiency. The code is at https://github.com/JC-202/HopGNN.

CVApr 17, 2023
CLIP-Lung: Textual Knowledge-Guided Lung Nodule Malignancy Prediction

Yiming Lei, Zilong Li, Yan Shen et al.

Lung nodule malignancy prediction has been enhanced by advanced deep-learning techniques and effective tricks. Nevertheless, current methods are mainly trained with cross-entropy loss using one-hot categorical labels, which results in difficulty in distinguishing those nodules with closer progression labels. Interestingly, we observe that clinical text information annotated by radiologists provides us with discriminative knowledge to identify challenging samples. Drawing on the capability of the contrastive language-image pre-training (CLIP) model to learn generalized visual representations from text annotations, in this paper, we propose CLIP-Lung, a textual knowledge-guided framework for lung nodule malignancy prediction. First, CLIP-Lung introduces both class and attribute annotations into the training of the lung nodule classifier without any additional overheads in inference. Second, we designed a channel-wise conditional prompt (CCP) module to establish consistent relationships between learnable context prompts and specific feature maps. Third, we align image features with both class and attribute features via contrastive learning, rectifying false positives and false negatives in latent space. The experimental results on the benchmark LIDC-IDRI dataset have demonstrated the superiority of CLIP-Lung, both in classification performance and interpretability of attention maps.

CVJan 15, 2023
CORE: Learning Consistent Ordinal REpresentations for Image Ordinal Estimation

Yiming Lei, Zilong Li, Yangyang Li et al.

The goal of image ordinal estimation is to estimate the ordinal label of a given image with a convolutional neural network. Existing methods are mainly based on ordinal regression and particularly focus on modeling the ordinal mapping from the feature representation of the input to the ordinal label space. However, the manifold of the resultant feature representations does not maintain the intrinsic ordinal relations of interest, which hinders the effectiveness of the image ordinal estimation. Therefore, this paper proposes learning intrinsic Consistent Ordinal REpresentations (CORE) from ordinal relations residing in groundtruth labels while encouraging the feature representations to embody the ordinal low-dimensional manifold. First, we develop an ordinal totally ordered set (toset) distribution (OTD), which can (i) model the label embeddings to inherit ordinal information and measure distances between ordered labels of samples in a neighborhood, and (ii) model the feature embeddings to infer numerical magnitude with unknown ordinal information among the features of different samples. Second, through OTD, we convert the feature representations and labels into the same embedding space for better alignment, and then compute the Kullback Leibler (KL) divergence between the ordinal labels and feature representations to endow the latent space with consistent ordinal relations. Third, we optimize the KL divergence through ordinal prototype-constrained convex programming with dual decomposition; our theoretical analysis shows that we can obtain the optimal solutions via gradient backpropagation. Extensive experimental results demonstrate that the proposed CORE can accurately construct an ordinal latent space and significantly enhance existing deep ordinal regression methods to achieve better results.

CLNov 7, 2025Code
Translation via Annotation: A Computational Study of Translating Classical Chinese into Japanese

Zilong Li, Jie Cao

Ancient people translated classical Chinese into Japanese by annotating around each character. We abstract this process as sequence tagging tasks and fit them into modern language technologies. The research of this annotation and translation system is a facing low-resource problem. We release this problem by introducing a LLM-based annotation pipeline and construct a new dataset from digitalized open-source translation data. We show that under the low-resource setting, introducing auxiliary Chinese NLP tasks has a promoting effect on the training of sequence tagging tasks. We also evaluate the performance of large language models. They achieve high scores in direct machine translation, but they are confused when being asked to annotate characters. Our method could work as a supplement of LLMs.

CVDec 8, 2023Code
Prompt-In-Prompt Learning for Universal Image Restoration

Zilong Li, Yiming Lei, Chenglong Ma et al.

Image restoration, which aims to retrieve and enhance degraded images, is fundamental across a wide range of applications. While conventional deep learning approaches have notably improved the image quality across various tasks, they still suffer from (i) the high storage cost needed for various task-specific models and (ii) the lack of interactivity and flexibility, hindering their wider application. Drawing inspiration from the pronounced success of prompts in both linguistic and visual domains, we propose novel Prompt-In-Prompt learning for universal image restoration, named PIP. First, we present two novel prompts, a degradation-aware prompt to encode high-level degradation knowledge and a basic restoration prompt to provide essential low-level information. Second, we devise a novel prompt-to-prompt interaction module to fuse these two prompts into a universal restoration prompt. Third, we introduce a selective prompt-to-feature interaction module to modulate the degradation-related feature. By doing so, the resultant PIP works as a plug-and-play module to enhance existing restoration models for universal image restoration. Extensive experimental results demonstrate the superior performance of PIP on multiple restoration tasks, including image denoising, deraining, dehazing, deblurring, and low-light enhancement. Remarkably, PIP is interpretable, flexible, efficient, and easy-to-use, showing promising potential for real-world applications. The code is available at https://github.com/longzilicart/pip_universal.

CVMay 17
GraphMAR: Geometry-Aware Graph Learning Framework for Spatially Adaptive CT Metal Artifact Reduction

Zilong Li, Chenglong Ma, Yiming Lei et al.

Computed tomography (CT) metal artifact reduction (MAR) aims to reduce the severe streaking artifacts induced by metallic implants and other high-density objects. Effective MAR generally requires both accurate artifact localization and artifact removal. Sinogram-domain methods can exploit explicit geometric cues, such as metal traces, to identify metal-corrupted measurements, while requiring raw projection data, which is often unavailable in clinical and practical scenarios. Image-domain methods are more flexible and widely applicable, yet they usually lack comparable geometric guidance, limiting their ability to localize artifacts and leading to suboptimal results. To address this limitation, we propose GraphMAR, a geometry-aware learning framework for explicit artifact identification and spatially adaptive MAR in the image domain. The key idea is to introduce graph-based geometric modeling as an image-domain analogue of sinogram metal traces. Specifically, we first construct a geometric graph from the metal mask and derive a geometric density graph that coarsely localizes artifact-prone regions according to inter-implant geometry. We then design GraphMoE, a graph-routed mixture-of-experts module that builds a polar-coordinate artifact graph in feature space and adaptively routes different experts to different spatial regions for MAR. By aligning the learned routing maps with the geometric density graph, GraphMAR provides explicit and interpretable artifact localization while enabling region-adaptive artifact reduction. Experiments on both simulated and real-world datasets demonstrate that GraphMAR achieves superior MAR performance compared with existing methods. To the best of our knowledge, this is the first work to introduce graph-based modeling for CT MAR and to enable explicit artifact identification in the image domain, improving both restoration quality and interpretability.

IVMay 25, 2025Code
MedITok: A Unified Tokenizer for Medical Image Synthesis and Interpretation

Chenglong Ma, Yuanfeng Ji, Jin Ye et al.

Advanced autoregressive models have reshaped multimodal AI. However, their transformative potential in medical imaging remains largely untapped due to the absence of a unified visual tokenizer -- one capable of capturing fine-grained visual structures for faithful image reconstruction and realistic image synthesis, as well as rich semantics for accurate diagnosis and image interpretation. To this end, we present MedITok, the first unified tokenizer tailored for medical images, encoding both low-level structural details and high-level clinical semantics within a unified latent space. To balance these competing objectives, we introduce a novel two-stage training framework: a visual representation alignment stage that cold-starts the tokenizer reconstruction learning with a visual semantic constraint, followed by a textual semantic representation alignment stage that infuses detailed clinical semantics into the latent space. Trained on the meticulously collected large-scale dataset with over 30 million medical images and 2 million image-caption pairs, MedITok achieves state-of-the-art performance on more than 30 datasets across 9 imaging modalities and 4 different tasks. By providing a unified token space for autoregressive modeling, MedITok supports a wide range of tasks in clinical diagnostics and generative healthcare applications. Model and code will be made publicly available at: https://github.com/Masaaki-75/meditok.

IVJan 26, 2025Code
Radiologist-in-the-Loop Self-Training for Generalizable CT Metal Artifact Reduction

Chenglong Ma, Zilong Li, Yuanlin Li et al.

Metal artifacts in computed tomography (CT) images can significantly degrade image quality and impede accurate diagnosis. Supervised metal artifact reduction (MAR) methods, trained using simulated datasets, often struggle to perform well on real clinical CT images due to a substantial domain gap. Although state-of-the-art semi-supervised methods use pseudo ground-truths generated by a prior network to mitigate this issue, their reliance on a fixed prior limits both the quality and quantity of these pseudo ground-truths, introducing confirmation bias and reducing clinical applicability. To address these limitations, we propose a novel Radiologist-In-the-loop SElf-training framework for MAR, termed RISE-MAR, which can integrate radiologists' feedback into the semi-supervised learning process, progressively improving the quality and quantity of pseudo ground-truths for enhanced generalization on real clinical CT images. For quality assurance, we introduce a clinical quality assessor model that emulates radiologist evaluations, effectively selecting high-quality pseudo ground-truths for semi-supervised training. For quantity assurance, our self-training framework iteratively generates additional high-quality pseudo ground-truths, expanding the clinical dataset and further improving model generalization. Extensive experimental results on multiple clinical datasets demonstrate the superior generalization performance of our RISE-MAR over state-of-the-art methods, advancing the development of MAR models for practical application. Code is available at https://github.com/Masaaki-75/rise-mar.

CVAug 24, 2025Code
FoundDiff: Foundational Diffusion Model for Generalizable Low-Dose CT Denoising

Zhihao Chen, Qi Gao, Zilong Li et al.

Low-dose computed tomography (CT) denoising is crucial for reduced radiation exposure while ensuring diagnostically acceptable image quality. Despite significant advancements driven by deep learning (DL) in recent years, existing DL-based methods, typically trained on a specific dose level and anatomical region, struggle to handle diverse noise characteristics and anatomical heterogeneity during varied scanning conditions, limiting their generalizability and robustness in clinical scenarios. In this paper, we propose FoundDiff, a foundational diffusion model for unified and generalizable LDCT denoising across various dose levels and anatomical regions. FoundDiff employs a two-stage strategy: (i) dose-anatomy perception and (ii) adaptive denoising. First, we develop a dose- and anatomy-aware contrastive language image pre-training model (DA-CLIP) to achieve robust dose and anatomy perception by leveraging specialized contrastive learning strategies to learn continuous representations that quantify ordinal dose variations and identify salient anatomical regions. Second, we design a dose- and anatomy-aware diffusion model (DA-Diff) to perform adaptive and generalizable denoising by synergistically integrating the learned dose and anatomy embeddings from DACLIP into diffusion process via a novel dose and anatomy conditional block (DACB) based on Mamba. Extensive experiments on two public LDCT datasets encompassing eight dose levels and three anatomical regions demonstrate superior denoising performance of FoundDiff over existing state-of-the-art methods and the remarkable generalization to unseen dose levels. The codes and models are available at https://github.com/hao1635/FoundDiff.

CVApr 1
PHASOR: Anatomy- and Phase-Consistent Volumetric Diffusion for CT Virtual Contrast Enhancement

Zilong Li, Dongyang Li, Chenglong Ma et al.

Contrast-enhanced computed tomography (CECT) is pivotal for highlighting tissue perfusion and vascularity, yet its clinical ubiquity is impeded by the invasive nature of contrast agents and radiation risks. While virtual contrast enhancement (VCE) offers an alternative to synthesizing CECT from non-contrast CT (NCCT), existing methods struggle with anatomical heterogeneity and spatial misalignment, leading to inconsistent enhancement patterns and incorrect details. This paper introduces PHASOR, a volumetric diffusion framework for high-fidelity CT VCE. By treating CT volumes as coherent sequences, we leverage a video diffusion model to enhance structural coherence and volumetric accuracy. To ensure anatomy-phase consistent synthesis, we introduce two complementary modules. First, anatomy-routed mixture-of-experts (AR-MoE) anchors distinct enhancement patterns to anatomical semantics, with organ-specific memory to capture salient details. Second, intensity-phase aware representation alignment (IP-REPA) highlights intricate contrast signals while mitigating the impact of imperfect spatial alignment. Extensive experiments across three datasets demonstrate that PHASOR significantly outperforms state-of-the-art methods in both synthesis quality and enhancement accuracy.

CLOct 19, 2024
Team Ryu's Submission to SIGMORPHON 2024 Shared Task on Subword Tokenization

Zilong Li

This papers presents the submission of team Ryu to the canceled SIGMORPHON 2024 shared task on subword tokenization. My submission explores whether morphological segmentation methods can be used as a part of subword tokenizers. I adopt two approaches: the statistical segmentation method Morfessor and a transformer based sequence-to-sequence (seq2seq) segmentation model in tokenizers. The prediction results show that morphological segmentation could be as effective as commonly used subword tokenizers. Additionally, I investigate how a tokenizer's vocabulary influences the performance of language models. A tokenizer with a balanced token frequency distribution tends to work better. A balanced token vocabulary can be achieved by keeping frequent words as unique tokens.