Linkai Peng

IV
h-index89
15papers
88citations
Novelty48%
AI Score52

15 Papers

31.2CVApr 20Code
Align then Refine: Text-Guided 3D Prostate Lesion Segmentation

Cuiling Sun, Linkai Peng, Adam Murphy et al.

Automated 3D segmentation of prostate lesions from biparametric MRI (bp-MRI) is essential for reliable algorithmic analysis, but achieving high precision remains challenging. Volumetric methods must combine multiple modalities while ensuring anatomical consistency, but current models struggle to integrate cross-modal information reliably. While vision-language models (VLMs) are replacing the currently used architectural designs, they still lack the fine-grained, lesion-level semantics required for effective localized guidance. To address these limitations, we propose a new multi-encoder U-Net architecture incorporating three key innovations: (1) an alignment loss that enhances foreground text-image similarity to inject lesion semantics; (2) a heatmap loss that calibrates the similarity map and suppresses spurious background activations; and (3) a final-stage, confidence-gated multi-head cross-attention refiner that performs localized boundary edits in high-confidence regions. A phase-scheduled training regime stabilizes the optimization of these components. Our method consistently outperforms prior approaches, establishing a new state-of-the-art on the PI-CAI dataset through enhanced multi-modal fusion and localized text guidance. Our code is available at https://github.com/NUBagciLab/Prostate-Lesion-Segmentation.

IVJul 27, 2024Code
Optimizing Synthetic Data for Enhanced Pancreatic Tumor Segmentation

Linkai Peng, Zheyuan Zhang, Gorkem Durak et al.

Pancreatic cancer remains one of the leading causes of cancer-related mortality worldwide. Precise segmentation of pancreatic tumors from medical images is a bottleneck for effective clinical decision-making. However, achieving a high accuracy is often limited by the small size and availability of real patient data for training deep learning models. Recent approaches have employed synthetic data generation to augment training datasets. While promising, these methods may not yet meet the performance benchmarks required for real-world clinical use. This study critically evaluates the limitations of existing generative-AI based frameworks for pancreatic tumor segmentation. We conduct a series of experiments to investigate the impact of synthetic \textit{tumor size} and \textit{boundary definition} precision on model performance. Our findings demonstrate that: (1) strategically selecting a combination of synthetic tumor sizes is crucial for optimal segmentation outcomes, and (2) generating synthetic tumors with precise boundaries significantly improves model accuracy. These insights highlight the importance of utilizing refined synthetic data augmentation for enhancing the clinical utility of segmentation models in pancreatic cancer decision making including diagnosis, prognosis, and treatment plans. Our code will be available at https://github.com/lkpengcs/SynTumorAnalyzer.

IVMar 9, 2022
Multi-modal Brain Tumor Segmentation via Missing Modality Synthesis and Modality-level Attention Fusion

Ziqi Huang, Li Lin, Pujin Cheng et al.

Multi-modal magnetic resonance (MR) imaging provides great potential for diagnosing and analyzing brain gliomas. In clinical scenarios, common MR sequences such as T1, T2 and FLAIR can be obtained simultaneously in a single scanning process. However, acquiring contrast enhanced modalities such as T1ce requires additional time, cost, and injection of contrast agent. As such, it is clinically meaningful to develop a method to synthesize unavailable modalities which can also be used as additional inputs to downstream tasks (e.g., brain tumor segmentation) for performance enhancing. In this work, we propose an end-to-end framework named Modality-Level Attention Fusion Network (MAF-Net), wherein we innovatively conduct patchwise contrastive learning for extracting multi-modal latent features and dynamically assigning attention weights to fuse different modalities. Through extensive experiments on BraTS2020, our proposed MAF-Net is found to yield superior T1ce synthesis performance (SSIM of 0.8879 and PSNR of 22.78) and accurate brain tumor segmentation (mean Dice scores of 67.9%, 41.8% and 88.0% on segmenting the tumor core, enhancing tumor and whole tumor).

IVMar 7, 2022
Student Becomes Decathlon Master in Retinal Vessel Segmentation via Dual-teacher Multi-target Domain Adaptation

Linkai Peng, Li Lin, Pujin Cheng et al.

Unsupervised domain adaptation has been proposed recently to tackle the so-called domain shift between training data and test data with different distributions. However, most of them only focus on single-target domain adaptation and cannot be applied to the scenario with multiple target domains. In this paper, we propose RVms, a novel unsupervised multi-target domain adaptation approach to segment retinal vessels (RVs) from multimodal and multicenter retinal images. RVms mainly consists of a style augmentation and transfer (SAT) module and a dual-teacher knowledge distillation (DTKD) module. SAT augments and clusters images into source-similar domains and source-dissimilar domains via Bezier and Fourier transformations. DTKD utilizes the augmented and transformed data to train two teachers, one for source-similar domains and the other for source-dissimilar domains. Afterwards, knowledge distillation is performed to iteratively distill different domain knowledge from teachers to a generic student. The local relative intensity transformation is employed to characterize RVs in a domain invariant manner and promote the generalizability of teachers and student models. Moreover, we construct a new multimodal and multicenter vascular segmentation dataset from existing publicly-available datasets, which can be used to benchmark various domain adaptation and domain generalization methods. Through extensive experiments, RVms is found to be very close to the target-trained Oracle in terms of segmenting the RVs, largely outperforming other state-of-the-art methods.

30.9CVApr 20
CrossPan: A Comprehensive Benchmark for Cross-Sequence Pancreas MRI Segmentation and Generalization

Linkai Peng, Cuiling Sun, Zheyuan Zhang et al.

Automatic pancreas segmentation is fundamental to abdominal MRI analysis, yet deep learning models trained on one MRI sequence often fail catastrophically when applied to another-a challenge that has received little systematic investigation. We introduce CrossPan, a multi-institutional benchmark comprising 1,386 3D scans across three routinely acquired sequences (T1-weighted, T2-weighted, and Out-of-Phase) from eight centers. Our experiments reveal three key findings. First, cross-sequence domain shifts are far more severe than cross-center variability: models achieving Dice scores above 0.85 in-domain collapse to near-zero (<0.02) when transferred across sequences. Second, state-of-the-art domain generalization methods provide negligible benefit under these physics-driven contrast inversions, whereas foundation models like MedSAM2 maintain moderate zero-shot performance through contrast-invariant shape priors. Third, semi-supervised learning offers gains only under stable intensity distributions and becomes unstable on sequences with high intra-organ variability. These results establish cross-sequence generalization-not model architecture or center diversity-as the primary barrier to clinically deployable pancreas MRI segmentation. Dataset and code are available at https://crosspan.netlify.app/.

CLJul 12, 2024
Pronunciation Assessment with Multi-modal Large Language Models

Kaiqi Fu, Linkai Peng, Nan Yang et al.

Large language models (LLMs), renowned for their powerful conversational abilities, are widely recognized as exceptional tools in the field of education, particularly in the context of automated intelligent instruction systems for language learning. In this paper, we propose a scoring system based on LLMs, motivated by their positive impact on text-related scoring tasks. Specifically, the speech encoder first maps the learner's speech into contextual features. The adapter layer then transforms these features to align with the text embedding in latent space. The assessment task-specific prefix and prompt text are embedded and concatenated with the features generated by the modality adapter layer, enabling the LLMs to predict accuracy and fluency scores. Our experiments demonstrate that the proposed scoring systems achieve competitive results compared to the baselines on the Speechocean762 datasets. Moreover, we also conducted an ablation study to better understand the contributions of the prompt text and training strategy in the proposed scoring system.

SDJun 15, 2022
Text-Aware End-to-end Mispronunciation Detection and Diagnosis

Linkai Peng, Yingming Gao, Binghuai Lin et al.

Mispronunciation detection and diagnosis (MDD) technology is a key component of computer-assisted pronunciation training system (CAPT). In the field of assessing the pronunciation quality of constrained speech, the given transcriptions can play the role of a teacher. Conventional methods have fully utilized the prior texts for the model construction or improving the system performance, e.g. forced-alignment and extended recognition networks. Recently, some end-to-end based methods attempt to incorporate the prior texts into model training and preliminarily show the effectiveness. However, previous studies mostly consider applying raw attention mechanism to fuse audio representations with text representations, without taking possible text-pronunciation mismatch into account. In this paper, we present a gating strategy that assigns more importance to the relevant audio features while suppressing irrelevant text information. Moreover, given the transcriptions, we design an extra contrastive loss to reduce the gap between the learning objective of phoneme recognition and MDD. We conducted experiments using two publicly available datasets (TIMIT and L2-Arctic) and our best model improved the F1 score from $57.51\%$ to $61.75\%$ compared to the baselines. Besides, we provide a detailed analysis to shed light on the effectiveness of gating mechanism and contrastive learning on MDD.

CLAug 28, 2023
Spoken Language Intelligence of Large Language Models for Language Learning

Linkai Peng, Baorian Nuchged, Yingming Gao

People have long hoped for a conversational system that can assist in real-life situations, and recent progress on large language models (LLMs) is bringing this idea closer to reality. While LLMs are often impressive in performance, their efficacy in real-world scenarios that demand expert knowledge remains unclear. LLMs are believed to hold the most potential and value in education, especially in the development of Artificial intelligence (AI) based virtual teachers capable of facilitating language learning. Our focus is centered on evaluating the efficacy of LLMs in the realm of education, specifically in the areas of spoken language learning which encompass phonetics, phonology, and second language acquisition. We introduce a new multiple-choice question dataset to evaluate the effectiveness of LLMs in the aforementioned scenarios, including understanding and application of spoken language knowledge. In addition, we investigate the influence of various prompting techniques such as zero- and few-shot method (prepending the question with question-answer exemplars), chain-of-thought (CoT, think step-by-step), in-domain exampler and external tools (Google, Wikipedia). We conducted large-scale evaluation on popular LLMs (20 distinct models) using these methods. We achieved significant performance improvements compared to the zero-shot baseline in the practical questions reasoning (GPT-3.5, 49.1% -> 63.1%; LLaMA2-70B-Chat, 42.2% -> 48.6%). We found that models of different sizes have good understanding of concepts in phonetics, phonology, and second language acquisition, but show limitations in reasoning for real-world problems. Additionally, we also explore preliminary findings on conversational communication.

CVApr 12, 2025Code
VideoAds for Fast-Paced Video Understanding

Zheyuan Zhang, Monica Dou, Linkai Peng et al.

Advertisement videos serve as a rich and valuable source of purpose-driven information, encompassing high-quality visual, textual, and contextual cues designed to engage viewers. They are often more complex than general videos of similar duration due to their structured narratives and rapid scene transitions, posing significant challenges to multi-modal large language models (MLLMs). In this work, we introduce VideoAds, the first dataset tailored for benchmarking the performance of MLLMs on advertisement videos. VideoAds comprises well-curated advertisement videos with complex temporal structures, accompanied by \textbf{manually} annotated diverse questions across three core tasks: visual finding, video summary, and visual reasoning. We propose a quantitative measure to compare VideoAds against existing benchmarks in terms of video complexity. Through extensive experiments, we find that Qwen2.5-VL-72B, an opensource MLLM, achieves 73.35\% accuracy on VideoAds, outperforming GPT-4o (66.82\%) and Gemini-1.5 Pro (69.66\%); the two proprietary models especially fall behind the opensource model in video summarization and reasoning, but perform the best in visual finding. Notably, human experts easily achieve a remarkable accuracy of 94.27\%. These results underscore the necessity of advancing MLLMs' temporal modeling capabilities and highlight VideoAds as a potentially pivotal benchmark for future research in understanding video that requires high FPS sampling. The dataset and evaluation code will be publicly available at https://videoadsbenchmark.netlify.app.

IVMay 21, 2025Code
TAGS: 3D Tumor-Adaptive Guidance for SAM

Sirui Li, Linkai Peng, Zheyuan Zhang et al.

Foundation models (FMs) such as CLIP and SAM have recently shown great promise in image segmentation tasks, yet their adaptation to 3D medical imaging-particularly for pathology detection and segmentation-remains underexplored. A critical challenge arises from the domain gap between natural images and medical volumes: existing FMs, pre-trained on 2D data, struggle to capture 3D anatomical context, limiting their utility in clinical applications like tumor segmentation. To address this, we propose an adaptation framework called TAGS: Tumor Adaptive Guidance for SAM, which unlocks 2D FMs for 3D medical tasks through multi-prompt fusion. By preserving most of the pre-trained weights, our approach enhances SAM's spatial feature extraction using CLIP's semantic insights and anatomy-specific prompts. Extensive experiments on three open-source tumor segmentation datasets prove that our model surpasses the state-of-the-art medical image segmentation models (+46.88% over nnUNet), interactive segmentation frameworks, and other established medical FMs, including SAM-Med2D, SAM-Med3D, SegVol, Universal, 3D-Adapter, and SAM-B (at least +13% over them). This highlights the robustness and adaptability of our proposed framework across diverse medical segmentation tasks.

IVMay 20, 2024Code
Large-Scale Multi-Center CT and MRI Segmentation of Pancreas with Deep Learning

Zheyuan Zhang, Elif Keles, Gorkem Durak et al.

Automated volumetric segmentation of the pancreas on cross-sectional imaging is needed for diagnosis and follow-up of pancreatic diseases. While CT-based pancreatic segmentation is more established, MRI-based segmentation methods are understudied, largely due to a lack of publicly available datasets, benchmarking research efforts, and domain-specific deep learning methods. In this retrospective study, we collected a large dataset (767 scans from 499 participants) of T1-weighted (T1W) and T2-weighted (T2W) abdominal MRI series from five centers between March 2004 and November 2022. We also collected CT scans of 1,350 patients from publicly available sources for benchmarking purposes. We developed a new pancreas segmentation method, called PanSegNet, combining the strengths of nnUNet and a Transformer network with a new linear attention module enabling volumetric computation. We tested PanSegNet's accuracy in cross-modality (a total of 2,117 scans) and cross-center settings with Dice and Hausdorff distance (HD95) evaluation metrics. We used Cohen's kappa statistics for intra and inter-rater agreement evaluation and paired t-tests for volume and Dice comparisons, respectively. For segmentation accuracy, we achieved Dice coefficients of 88.3% (std: 7.2%, at case level) with CT, 85.0% (std: 7.9%) with T1W MRI, and 86.3% (std: 6.4%) with T2W MRI. There was a high correlation for pancreas volume prediction with R^2 of 0.91, 0.84, and 0.85 for CT, T1W, and T2W, respectively. We found moderate inter-observer (0.624 and 0.638 for T1W and T2W MRI, respectively) and high intra-observer agreement scores. All MRI data is made available at https://osf.io/kysnj/. Our source code is available at https://github.com/NUBagciLab/PaNSegNet.

CVDec 11, 2022Code
YoloCurvSeg: You Only Label One Noisy Skeleton for Vessel-style Curvilinear Structure Segmentation

Li Lin, Linkai Peng, Huaqing He et al.

Weakly-supervised learning (WSL) has been proposed to alleviate the conflict between data annotation cost and model performance through employing sparsely-grained (i.e., point-, box-, scribble-wise) supervision and has shown promising performance, particularly in the image segmentation field. However, it is still a very challenging task due to the limited supervision, especially when only a small number of labeled samples are available. Additionally, almost all existing WSL segmentation methods are designed for star-convex structures which are very different from curvilinear structures such as vessels and nerves. In this paper, we propose a novel sparsely annotated segmentation framework for curvilinear structures, named YoloCurvSeg. A very essential component of YoloCurvSeg is image synthesis. Specifically, a background generator delivers image backgrounds that closely match the real distributions through inpainting dilated skeletons. The extracted backgrounds are then combined with randomly emulated curves generated by a Space Colonization Algorithm-based foreground generator and through a multilayer patch-wise contrastive learning synthesizer. In this way, a synthetic dataset with both images and curve segmentation labels is obtained, at the cost of only one or a few noisy skeleton annotations. Finally, a segmenter is trained with the generated dataset and possibly an unlabeled dataset. The proposed YoloCurvSeg is evaluated on four publicly available datasets (OCTA500, CORN, DRIVE and CHASEDB1) and the results show that YoloCurvSeg outperforms state-of-the-art WSL segmentation methods by large margins. With only one noisy skeleton annotation (respectively 0.14\%, 0.03\%, 1.40\%, and 0.65\% of the full annotation), YoloCurvSeg achieves more than 97\% of the fully-supervised performance on each dataset. Code and datasets will be released at https://github.com/llmir/YoloCurvSeg.

CVJun 18, 2025
Pediatric Pancreas Segmentation from MRI Scans with Deep Learning

Elif Keles, Merve Yazol, Gorkem Durak et al.

Objective: Our study aimed to evaluate and validate PanSegNet, a deep learning (DL) algorithm for pediatric pancreas segmentation on MRI in children with acute pancreatitis (AP), chronic pancreatitis (CP), and healthy controls. Methods: With IRB approval, we retrospectively collected 84 MRI scans (1.5T/3T Siemens Aera/Verio) from children aged 2-19 years at Gazi University (2015-2024). The dataset includes healthy children as well as patients diagnosed with AP or CP based on clinical criteria. Pediatric and general radiologists manually segmented the pancreas, then confirmed by a senior pediatric radiologist. PanSegNet-generated segmentations were assessed using Dice Similarity Coefficient (DSC) and 95th percentile Hausdorff distance (HD95). Cohen's kappa measured observer agreement. Results: Pancreas MRI T2W scans were obtained from 42 children with AP/CP (mean age: 11.73 +/- 3.9 years) and 42 healthy children (mean age: 11.19 +/- 4.88 years). PanSegNet achieved DSC scores of 88% (controls), 81% (AP), and 80% (CP), with HD95 values of 3.98 mm (controls), 9.85 mm (AP), and 15.67 mm (CP). Inter-observer kappa was 0.86 (controls), 0.82 (pancreatitis), and intra-observer agreement reached 0.88 and 0.81. Strong agreement was observed between automated and manual volumes (R^2 = 0.85 in controls, 0.77 in diseased), demonstrating clinical reliability. Conclusion: PanSegNet represents the first validated deep learning solution for pancreatic MRI segmentation, achieving expert-level performance across healthy and diseased states. This tool, algorithm, along with our annotated dataset, are freely available on GitHub and OSF, advancing accessible, radiation-free pediatric pancreatic imaging and fostering collaborative research in this underserved domain.

IVJul 30, 2025
Rethink Domain Generalization in Heterogeneous Sequence MRI Segmentation

Zheyuan Zhang, Linkai Peng, Wanying Dou et al.

Clinical magnetic-resonance (MR) protocols generate many T1 and T2 sequences whose appearance differs more than the acquisition sites that produce them. Existing domain-generalization benchmarks focus almost on cross-center shifts and overlook this dominant source of variability. Pancreas segmentation remains a major challenge in abdominal imaging: the gland is small, irregularly, surrounded by organs and fat, and often suffers from low T1 contrast. State-of-the-art deep networks that already achieve >90% Dice on the liver or kidneys still miss 20-30% of the pancreas. The organ is also systematically under-represented in public cross-domain benchmarks, despite its clinical importance in early cancer detection, surgery, and diabetes research. To close this gap, we present PancreasDG, a large-scale multi-center 3D MRI pancreas segmentation dataset for investigating domain generalization in medical imaging. The dataset comprises 563 MRI scans from six institutions, spanning both venous phase and out-of-phase sequences, enabling study of both cross-center and cross-sequence variations with pixel-accurate pancreas masks created by a double-blind, two-pass protocol. Through comprehensive analysis, we reveal three insights: (i) limited sampling introduces significant variance that may be mistaken for distribution shifts, (ii) cross-center performance correlates with source domain performance for identical sequences, and (iii) cross-sequence shifts require specialized solutions. We also propose a semi-supervised approach that leverages anatomical invariances, significantly outperforming state-of-the-art domain generalization techniques with 61.63% Dice score improvements and 87.00% on two test centers for cross-sequence segmentation. PancreasDG sets a new benchmark for domain generalization in medical imaging. Dataset, code, and models will be available at https://pancreasdg.netlify.app.

IVJan 13, 2022
Unsupervised Domain Adaptation for Cross-Modality Retinal Vessel Segmentation via Disentangling Representation Style Transfer and Collaborative Consistency Learning

Linkai Peng, Li Lin, Pujin Cheng et al.

Various deep learning models have been developed to segment anatomical structures from medical images, but they typically have poor performance when tested on another target domain with different data distribution. Recently, unsupervised domain adaptation methods have been proposed to alleviate this so-called domain shift issue, but most of them are designed for scenarios with relatively small domain shifts and are likely to fail when encountering a large domain gap. In this paper, we propose DCDA, a novel cross-modality unsupervised domain adaptation framework for tasks with large domain shifts, e.g., segmenting retinal vessels from OCTA and OCT images. DCDA mainly consists of a disentangling representation style transfer (DRST) module and a collaborative consistency learning (CCL) module. DRST decomposes images into content components and style codes and performs style transfer and image reconstruction. CCL contains two segmentation models, one for source domain and the other for target domain. The two models use labeled data (together with the corresponding transferred images) for supervised learning and perform collaborative consistency learning on unlabeled data. Each model focuses on the corresponding single domain and aims to yield an expertized domain-specific segmentation model. Through extensive experiments on retinal vessel segmentation, our framework achieves Dice scores close to target-trained oracle both from OCTA to OCT and from OCT to OCTA, significantly outperforming other state-of-the-art methods.