Yi Lin

CV
h-index36
85papers
2,067citations
Novelty48%
AI Score59

85 Papers

IVAug 27, 2023Code
Bi-Modality Medical Image Synthesis Using Semi-Supervised Sequential Generative Adversarial Networks

Xin Yang, Yi Lin, Zhiwei Wang et al.

In this paper, we propose a bi-modality medical image synthesis approach based on sequential generative adversarial network (GAN) and semi-supervised learning. Our approach consists of two generative modules that synthesize images of the two modalities in a sequential order. A method for measuring the synthesis complexity is proposed to automatically determine the synthesis order in our sequential GAN. Images of the modality with a lower complexity are synthesized first, and the counterparts with a higher complexity are generated later. Our sequential GAN is trained end-to-end in a semi-supervised manner. In supervised training, the joint distribution of bi-modality images are learned from real paired images of the two modalities by explicitly minimizing the reconstruction losses between the real and synthetic images. To avoid overfitting limited training images, in unsupervised training, the marginal distribution of each modality is learned based on unpaired images by minimizing the Wasserstein distance between the distributions of real and fake images. We comprehensively evaluate the proposed model using two synthesis tasks based on three types of evaluate metrics and user studies. Visual and quantitative results demonstrate the superiority of our method to the state-of-the-art methods, and reasonable visual quality and clinical significance. Code is made publicly available at https://github.com/hustlinyi/Multimodal-Medical-Image-Synthesis.

CVMar 24, 2023Code
Few Shot Medical Image Segmentation with Cross Attention Transformer

Yi Lin, Yufan Chen, Kwang-Ting Cheng et al.

Medical image segmentation has made significant progress in recent years. Deep learning-based methods are recognized as data-hungry techniques, requiring large amounts of data with manual annotations. However, manual annotation is expensive in the field of medical image analysis, which requires domain-specific expertise. To address this challenge, few-shot learning has the potential to learn new classes from only a few examples. In this work, we propose a novel framework for few-shot medical image segmentation, termed CAT-Net, based on cross masked attention Transformer. Our proposed network mines the correlations between the support image and query image, limiting them to focus only on useful foreground information and boosting the representation capacity of both the support prototype and query features. We further design an iterative refinement framework that refines the query image segmentation iteratively and promotes the support feature in turn. We validated the proposed method on three public datasets: Abd-CT, Abd-MRI, and Card-MRI. Experimental results demonstrate the superior performance of our method compared to state-of-the-art methods and the effectiveness of each component. Code: https://github.com/hust-linyi/CAT-Net.

CVAug 24, 2023Code
PromptMRG: Diagnosis-Driven Prompts for Medical Report Generation

Haibo Jin, Haoxuan Che, Yi Lin et al.

Automatic medical report generation (MRG) is of great research value as it has the potential to relieve radiologists from the heavy burden of report writing. Despite recent advancements, accurate MRG remains challenging due to the need for precise clinical understanding and disease identification. Moreover, the imbalanced distribution of diseases makes the challenge even more pronounced, as rare diseases are underrepresented in training data, making their diagnostic performance unreliable. To address these challenges, we propose diagnosis-driven prompts for medical report generation (PromptMRG), a novel framework that aims to improve the diagnostic accuracy of MRG with the guidance of diagnosis-aware prompts. Specifically, PromptMRG is based on encoder-decoder architecture with an extra disease classification branch. When generating reports, the diagnostic results from the classification branch are converted into token prompts to explicitly guide the generation process. To further improve the diagnostic accuracy, we design cross-modal feature enhancement, which retrieves similar reports from the database to assist the diagnosis of a query image by leveraging the knowledge from a pre-trained CLIP. Moreover, the disease imbalanced issue is addressed by applying an adaptive logit-adjusted loss to the classification branch based on the individual learning status of each disease, which overcomes the barrier of text decoder's inability to manipulate disease distributions. Experiments on two MRG benchmarks show the effectiveness of the proposed method, where it obtains state-of-the-art clinical efficacy performance on both datasets. The code is available at https://github.com/jhb86253817/PromptMRG.

IVMar 23, 2023Code
Boosting Convolution with Efficient MLP-Permutation for Volumetric Medical Image Segmentation

Yi Lin, Xiao Fang, Dong Zhang et al.

Recently, the advent of vision Transformer (ViT) has brought substantial advancements in 3D dataset benchmarks, particularly in 3D volumetric medical image segmentation (Vol-MedSeg). Concurrently, multi-layer perceptron (MLP) network has regained popularity among researchers due to their comparable results to ViT, albeit with the exclusion of the resource-intensive self-attention module. In this work, we propose a novel permutable hybrid network for Vol-MedSeg, named PHNet, which capitalizes on the strengths of both convolution neural networks (CNNs) and MLP. PHNet addresses the intrinsic isotropy problem of 3D volumetric data by employing a combination of 2D and 3D CNNs to extract local features. Besides, we propose an efficient multi-layer permute perceptron (MLPP) module that captures long-range dependence while preserving positional information. This is achieved through an axis decomposition operation that permutes the input tensor along different axes, thereby enabling the separate encoding of the positional information. Furthermore, MLPP tackles the resolution sensitivity issue of MLP in Vol-MedSeg with a token segmentation operation, which divides the feature into smaller tokens and processes them individually. Extensive experimental results validate that PHNet outperforms the state-of-the-art methods with lower computational costs on the widely-used yet challenging COVID-19-20 and Synapse benchmarks. The ablation study also demonstrates the effectiveness of PHNet in harnessing the strengths of both CNNs and MLP. The code is available on Github: \href{https://github.com/xiaofang007/PHNet}{https://github.com/xiaofang007/PHNet}.

CVSep 21, 2022Code
Understanding the Tricks of Deep Learning in Medical Image Segmentation: Challenges and Future Directions

Dong Zhang, Yi Lin, Hao Chen et al.

Over the past few years, the rapid development of deep learning technologies for computer vision has significantly improved the performance of medical image segmentation (MedISeg). However, the diverse implementation strategies of various models have led to an extremely complex MedISeg system, resulting in a potential problem of unfair result comparisons. In this paper, we collect a series of MedISeg tricks for different model implementation phases (i.e., pre-training model, data pre-processing, data augmentation, model implementation, model inference, and result post-processing), and experimentally explore the effectiveness of these tricks on consistent baselines. With the extensive experimental results on both the representative 2D and 3D medical image datasets, we explicitly clarify the effect of these tricks. Moreover, based on the surveyed tricks, we also open-sourced a strong MedISeg repository, where each component has the advantage of plug-and-play. We believe that this milestone work not only completes a comprehensive and complementary survey of the state-of-the-art MedISeg approaches, but also offers a practical guide for addressing the future medical image processing challenges including but not limited to small dataset, class imbalance learning, multi-modality learning, and domain adaptation. The code and training weights have been released at: https://github.com/hust-linyi/seg_trick.

LGNov 24, 2022Code
Explainable and Safe Reinforcement Learning for Autonomous Air Mobility

Lei Wang, Hongyu Yang, Yi Lin et al.

Increasing traffic demands, higher levels of automation, and communication enhancements provide novel design opportunities for future air traffic controllers (ATCs). This article presents a novel deep reinforcement learning (DRL) controller to aid conflict resolution for autonomous free flight. Although DRL has achieved important advancements in this field, the existing works pay little attention to the explainability and safety issues related to DRL controllers, particularly the safety under adversarial attacks. To address those two issues, we design a fully explainable DRL framework wherein we: 1) decompose the coupled Q value learning model into a safety-awareness and efficiency (reach the target) one; and 2) use information from surrounding intruders as inputs, eliminating the needs of central controllers. In our simulated experiments, we show that by decoupling the safety-awareness and efficiency, we can exceed performance on free flight control tasks while dramatically improving explainability on practical. In addition, the safety Q learning module provides rich information about the safety situation of environments. To study the safety under adversarial attacks, we additionally propose an adversarial attack strategy that can impose both safety-oriented and efficiency-oriented attacks. The adversarial aims to minimize safety/efficiency by only attacking the agent at a few time steps. In the experiments, our attack strategy increases as many collisions as the uniform attack (i.e., attacking at every time step) by only attacking the agent four times less often, which provide insights into the capabilities and restrictions of the DRL in future ATC designs. The source code is publicly available at https://github.com/WLeiiiii/Gym-ATC-Attack-Project.

CVSep 26, 2024Code
Revisiting Deep Ensemble Uncertainty for Enhanced Medical Anomaly Detection

Yi Gu, Yi Lin, Kwang-Ting Cheng et al.

Medical anomaly detection (AD) is crucial in pathological identification and localization. Current methods typically rely on uncertainty estimation in deep ensembles to detect anomalies, assuming that ensemble learners should agree on normal samples while exhibiting disagreement on unseen anomalies in the output space. However, these methods may suffer from inadequate disagreement on anomalies or diminished agreement on normal samples. To tackle these issues, we propose D2UE, a Diversified Dual-space Uncertainty Estimation framework for medical anomaly detection. To effectively balance agreement and disagreement for anomaly detection, we propose Redundancy-Aware Repulsion (RAR), which uses a similarity kernel that remains invariant to both isotropic scaling and orthogonal transformations, explicitly promoting diversity in learners' feature space. Moreover, to accentuate anomalous regions, we develop Dual-Space Uncertainty (DSU), which utilizes the ensemble's uncertainty in input and output spaces. In input space, we first calculate gradients of reconstruction error with respect to input images. The gradients are then integrated with reconstruction outputs to estimate uncertainty for inputs, enabling effective anomaly discrimination even when output space disagreement is minimal. We conduct a comprehensive evaluation of five medical benchmarks with different backbones. Experimental results demonstrate the superiority of our method to state-of-the-art methods and the effectiveness of each component in our framework. Our code is available at https://github.com/Rubiscol/D2UE.

LGNov 1, 2023Code
WinNet: Make Only One Convolutional Layer Effective for Time Series Forecasting

Wenjie Ou, Zhishuo Zhao, Dongyue Guo et al.

Deep learning models have recently achieved significant performance improvements in time series forecasting. We present a highly accurate and simply structured CNN-based model with only one convolutional layer, called WinNet, including (i) Sub-window Division block to transform the series into 2D tensor, (ii) Dual-Forecasting mechanism to capture the short- and long-term variations, (iii) Two-dimensional Hybrid Decomposition (TDD) block to decompose the 2D tensor into the trend and seasonal terms to eliminate the non-stationarity, and (iv) Decomposition Correlation Block (DCB) to leverage the correlation between the trend and seasonal terms by the convolution layer. Results on eight benchmark datasets demonstrate that WinNet can achieve SOTA performance and lower computational complexity over CNN-, MLP- and Transformer-based methods. The code will be available at: https://github.com/ouwen18/WinNet.

IVApr 13, 2023
Deep Learning in Breast Cancer Imaging: A Decade of Progress and Future Directions

Luyang Luo, Xi Wang, Yi Lin et al.

Breast cancer has reached the highest incidence rate worldwide among all malignancies since 2020. Breast imaging plays a significant role in early diagnosis and intervention to improve the outcome of breast cancer patients. In the past decade, deep learning has shown remarkable progress in breast cancer imaging analysis, holding great promise in interpreting the rich information and complex context of breast imaging modalities. Considering the rapid improvement in deep learning technology and the increasing severity of breast cancer, it is critical to summarize past progress and identify future challenges to be addressed. This paper provides an extensive review of deep learning-based breast cancer imaging research, covering studies on mammogram, ultrasound, magnetic resonance imaging, and digital pathology images over the past decade. The major deep learning methods and applications on imaging-based screening, diagnosis, treatment response prediction, and prognosis are elaborated and discussed. Drawn from the findings of this survey, we present a comprehensive discussion of the challenges and potential avenues for future research in deep learning-based breast cancer imaging.

CLMar 16Code
Mixture-of-Depths Attention

Lianghui Zhu, Yuxin Fang, Bencheng Liao et al.

Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-efficient algorithm for MoDA that resolves non-contiguous memory-access patterns, achieving 97.3% of FlashAttention-2's efficiency at a sequence length of 64K. Experiments on 1.5B-parameter models demonstrate that MoDA consistently outperforms strong baselines. Notably, it improves average perplexity by 0.2 across 10 validation benchmarks and increases average performance by 2.11% on 10 downstream tasks, with a negligible 3.7% FLOPs computational overhead. We also find that combining MoDA with post-norm yields better performance than using it with pre-norm. These results suggest that MoDA is a promising primitive for depth scaling. Code is released at https://github.com/hustvl/MoDA .

CVMar 16, 2023
Learning a Room with the Occ-SDF Hybrid: Signed Distance Function Mingled with Occupancy Aids Scene Representation

Xiaoyang Lyu, Peng Dai, Zizhang Li et al.

Implicit neural rendering, which uses signed distance function (SDF) representation with geometric priors (such as depth or surface normal), has led to impressive progress in the surface reconstruction of large-scale scenes. However, applying this method to reconstruct a room-level scene from images may miss structures in low-intensity areas or small and thin objects. We conducted experiments on three datasets to identify limitations of the original color rendering loss and priors-embedded SDF scene representation. We found that the color rendering loss results in optimization bias against low-intensity areas, causing gradient vanishing and leaving these areas unoptimized. To address this issue, we propose a feature-based color rendering loss that utilizes non-zero feature values to bring back optimization signals. Additionally, the SDF representation can be influenced by objects along a ray path, disrupting the monotonic change of SDF values when a single object is present. To counteract this, we explore using the occupancy representation, which encodes each point separately and is unaffected by objects along a querying ray. Our experimental results demonstrate that the joint forces of the feature-based rendering loss and Occ-SDF hybrid representation scheme can provide high-quality reconstruction results, especially in challenging room-level scenarios. The code would be released.

IVJun 30, 2022
InsMix: Towards Realistic Generative Data Augmentation for Nuclei Instance Segmentation

Yi Lin, Zeyu Wang, Kwang-Ting Cheng et al.

Nuclei Segmentation from histology images is a fundamental task in digital pathology analysis. However, deep-learning-based nuclei segmentation methods often suffer from limited annotations. This paper proposes a realistic data augmentation method for nuclei segmentation, named InsMix, that follows a Copy-Paste-Smooth principle and performs morphology-constrained generative instance augmentation. Specifically, we propose morphology constraints that enable the augmented images to acquire luxuriant information about nuclei while maintaining their morphology characteristics (e.g., geometry and location). To fully exploit the pixel redundancy of the background and improve the model's robustness, we further propose a background perturbation method, which randomly shuffles the background patches without disordering the original nuclei distribution. To achieve contextual consistency between original and template instances, a smooth-GAN is designed with a foreground similarity encoder (FSE) and a triplet loss. We validated the proposed method on two datasets, i.e., Kumar and CPS datasets. Experimental results demonstrate the effectiveness of each component and the superior performance achieved by our method to the state-of-the-art methods.

CVAug 26, 2022
Seg4Reg+: Consistency Learning between Spine Segmentation and Cobb Angle Regression

Yi Lin, Luyan Liu, Kai Ma et al.

Automated methods for Cobb angle estimation are of high demand for scoliosis assessment. Existing methods typically calculate the Cobb angle from landmark estimation, or simply combine the low-level task (e.g., landmark detection and spine segmentation) with the Cobb angle regression task, without fully exploring the benefits from each other. In this study, we propose a novel multi-task framework, named Seg4Reg+, which jointly optimizes the segmentation and regression networks. We thoroughly investigate both local and global consistency and knowledge transfer between each other. Specifically, we propose an attention regularization module leveraging class activation maps (CAMs) from image-segmentation pairs to discover additional supervision in the regression network, and the CAMs can serve as a region-of-interest enhancement gate to facilitate the segmentation task in turn. Meanwhile, we design a novel triangle consistency learning to train the two networks jointly for global optimization. The evaluations performed on the public AASCE Challenge dataset demonstrate the effectiveness of each module and superior performance of our model to the state-of-the-art methods.

CVMar 10, 2023
TAKT: Target-Aware Knowledge Transfer for Whole Slide Image Classification

Conghao Xiong, Yi Lin, Hao Chen et al.

Transferring knowledge from a source domain to a target domain can be crucial for whole slide image classification, since the number of samples in a dataset is often limited due to high annotation costs. However, domain shift and task discrepancy between datasets can hinder effective knowledge transfer. In this paper, we propose a Target-Aware Knowledge Transfer framework, employing a teacher-student paradigm. Our framework enables the teacher model to learn common knowledge from the source and target domains by actively incorporating unlabelled target images into the training of the teacher model. The teacher bag features are subsequently adapted to supervise the training of the student model on the target domain. Despite incorporating the target features during training, the teacher model tends to overlook them under the inherent domain shift and task discrepancy. To alleviate this, we introduce a target-aware feature alignment module to establish a transferable latent relationship between the source and target features by solving the optimal transport problem. Experimental results show that models employing knowledge transfer outperform those trained from scratch, and our method achieves state-of-the-art performance among other knowledge transfer methods on various datasets, including TCGA-RCC, TCGA-NSCLC, and Camelyon16.

LGJul 31, 2024Code
ProSpec RL: Plan Ahead, then Execute

Liangliang Liu, Yi Guan, BoRan Wang et al.

Imagining potential outcomes of actions before execution helps agents make more informed decisions, a prospective thinking ability fundamental to human cognition. However, mainstream model-free Reinforcement Learning (RL) methods lack the ability to proactively envision future scenarios, plan, and guide strategies. These methods typically rely on trial and error to adjust policy functions, aiming to maximize cumulative rewards or long-term value, even if such high-reward decisions place the environment in extremely dangerous states. To address this, we propose the Prospective (ProSpec) RL method, which makes higher-value, lower-risk optimal decisions by imagining future n-stream trajectories. Specifically, ProSpec employs a dynamic model to predict future states (termed "imagined states") based on the current state and a series of sampled actions. Furthermore, we integrate the concept of Model Predictive Control and introduce a cycle consistency constraint that allows the agent to evaluate and select the optimal actions from these trajectories. Moreover, ProSpec employs cycle consistency to mitigate two fundamental issues in RL: augmenting state reversibility to avoid irreversible events (low risk) and augmenting actions to generate numerous virtual trajectories, thereby improving data efficiency. We validated the effectiveness of our method on the DMControl benchmarks, where our approach achieved significant performance improvements. Code will be open-sourced upon acceptance.

LGJul 14, 2022
Spatiotemporal Propagation Learning for Network-Wide Flight Delay Prediction

Yuankai Wu, Hongyu Yang, Yi Lin et al.

Demystifying the delay propagation mechanisms among multiple airports is fundamental to precise and interpretable delay prediction, which is crucial during decision-making for all aviation industry stakeholders. The principal challenge lies in effectively leveraging the spatiotemporal dependencies and exogenous factors related to the delay propagation. However, previous works only consider limited spatiotemporal patterns with few factors. To promote more comprehensive propagation modeling for delay prediction, we propose SpatioTemporal Propagation Network (STPN), a space-time separable graph convolutional network, which is novel in spatiotemporal dependency capturing. From the aspect of spatial relation modeling, we propose a multi-graph convolution model considering both geographic proximity and airline schedule. From the aspect of temporal dependency capturing, we propose a multi-head self-attentional mechanism that can be learned end-to-end and explicitly reason multiple kinds of temporal dependency of delay time series. We show that the joint spatial and temporal learning models yield a sum of the Kronecker product, which factors the spatiotemporal dependence into the sum of several spatial and temporal adjacency matrices. By this means, STPN allows cross-talk of spatial and temporal factors for modeling delay propagation. Furthermore, a squeeze and excitation module is added to each layer of STPN to boost meaningful spatiotemporal features. To this end, we apply STPN to multi-step ahead arrival and departure delay prediction in large-scale airport networks. To validate the effectiveness of our model, we experiment with two real-world delay datasets, including U.S and China flight delays; and we show that STPN outperforms state-of-the-art methods. In addition, counterfactuals produced by STPN show that it learns explainable delay propagation patterns.

CVMar 23, 2023
Exploring Visual Prompts for Whole Slide Image Classification with Multiple Instance Learning

Yi Lin, Zhongchen Zhao, Zhengjie ZHU et al.

Multiple instance learning (MIL) has emerged as a popular method for classifying histopathology whole slide images (WSIs). However, existing approaches typically rely on pre-trained models from large natural image datasets, such as ImageNet, to generate instance features, which can be sub-optimal due to the significant differences between natural images and histopathology images that lead to a domain shift. In this paper, we present a novel, simple yet effective method for learning domain-specific knowledge transformation from pre-trained models to histopathology images. Our approach entails using a prompt component to assist the pre-trained model in discerning differences between the pre-trained dataset and the target histopathology dataset, resulting in improved performance of MIL models. We validate our method on two publicly available datasets, Camelyon16 and TCGA-NSCLC. Extensive experimental results demonstrate the significant performance improvement of our method for different MIL models and backbones. Upon publication of this paper, we will release the source code for our method.

AIMay 25
Towards end-to-end LLM-based censoring-aware survival analysis

Yishu Wei, Hexin Dong, Yi Lin et al.

Objective: Survival analysis is central to medical prediction, yet large language models (LLMs) are rarely used as end-to-end survival models because censoring prevents straightforward supervised fine-tuning. Here we present LLMSurvival, a framework that enables censoring-aware survival analysis with unmodified LLMs operating directly on tabular clinical data. Materials and Methods: LLMSurvival reformulates time-to-event prediction as pairwise ranking among comparable subjects, and derives test-time risk by aggregating comparisons against anchor individuals from the training cohort. Results: Across two clinical tasks (ICU mortality prediction in MIMIC-IV and fragility fracture prediction in a NewYork-Presbyterian/Weill Cornell Medicine cohort), LLMSurvival improves overall concordance over Cox proportional hazards modeling by 3.1% for ICU mortality and 0.5% for fracture risk, 2.1% on average for ICU mortality and 2.8% for fracture risk over three established deep learning survival models. Discussion: The results show that survival modeling with censoring can be made compatible with LLM fine-tuning through comparison-based reformulation. The framework demonstrates high portability and superior performance over expert curated scores like SAPS-II and FRAX scores across diverse clinical context. Furthermore, the framework supports local deployment, as compact, publicly available base models provide sufficient performance. Conclusion: The LLMSurvival framework serves as a proof of concept for an integrated, censoring-conscious approach to survival analysis via LLMs.

CVMar 22, 2023
Label-Efficient Deep Learning in Medical Image Analysis: Challenges and Future Directions

Cheng Jin, Zhengrui Guo, Yi Lin et al.

Deep learning has significantly advanced medical imaging analysis (MIA), achieving state-of-the-art performance across diverse clinical tasks. However, its success largely depends on large-scale, high-quality labeled datasets, which are costly and time-consuming to obtain due to the need for expert annotation. To mitigate this limitation, label-efficient deep learning methods have emerged to improve model performance under limited supervision by leveraging labeled, unlabeled, and weakly labeled data. In this survey, we systematically review over 350 peer-reviewed studies and present a comprehensive taxonomy of label-efficient learning methods in MIA. These methods are categorized into four labeling paradigms: no label, insufficient label, inexact label, and label refinement. For each category, we analyze representative techniques across imaging modalities and clinical applications, highlighting shared methodological principles and task-specific adaptations. We also examine the growing role of health foundation models (HFMs) in enabling label-efficient learning through large-scale pre-training and transfer learning, enhancing the use of limited annotations in downstream tasks. Finally, we identify current challenges and future directions to facilitate the translation of label-efficient learning from research promise to everyday clinical care.

CVApr 3, 2023
Disorder-invariant Implicit Neural Representation

Hao Zhu, Shaowen Xie, Zhen Liu et al.

Implicit neural representation (INR) characterizes the attributes of a signal as a function of corresponding coordinates which emerges as a sharp weapon for solving inverse problems. However, the expressive power of INR is limited by the spectral bias in the network training. In this paper, we find that such a frequency-related problem could be greatly solved by re-arranging the coordinates of the input signal, for which we propose the disorder-invariant implicit neural representation (DINER) by augmenting a hash-table to a traditional INR backbone. Given discrete signals sharing the same histogram of attributes and different arrangement orders, the hash-table could project the coordinates into the same distribution for which the mapped signal can be better modeled using the subsequent INR network, leading to significantly alleviated spectral bias. Furthermore, the expressive power of the DINER is determined by the width of the hash-table. Different width corresponds to different geometrical elements in the attribute space, \textit{e.g.}, 1D curve, 2D curved-plane and 3D curved-volume when the width is set as $1$, $2$ and $3$, respectively. More covered areas of the geometrical elements result in stronger expressive power. Experiments not only reveal the generalization of the DINER for different INR backbones (MLP vs. SIREN) and various tasks (image/video representation, phase retrieval, refractive index recovery, and neural radiance field optimization) but also show the superiority over the state-of-the-art algorithms both in quality and speed. \textit{Project page:} \url{https://ezio77.github.io/DINER-website/}

CVAug 31, 2024
Aligning Medical Images with General Knowledge from Large Language Models

Xiao Fang, Yi Lin, Dong Zhang et al.

Pre-trained large vision-language models (VLMs) like CLIP have revolutionized visual representation learning using natural language as supervisions, and demonstrated promising generalization ability. In this work, we propose ViP, a novel visual symptom-guided prompt learning framework for medical image analysis, which facilitates general knowledge transfer from CLIP. ViP consists of two key components: a visual symptom generator (VSG) and a dual-prompt network. Specifically, VSG aims to extract explicable visual symptoms from pre-trained large language models, while the dual-prompt network utilizes these visual symptoms to guide the training on two learnable prompt modules, i.e., context prompt and merge prompt, which effectively adapts our framework to medical image analysis via large VLMs. Extensive experimental results demonstrate that ViP can outperform state-of-the-art methods on two challenging datasets.

IVJul 26, 2024
Towards A Generalizable Pathology Foundation Model via Unified Knowledge Distillation

Jiabo Ma, Zhengrui Guo, Fengtao Zhou et al.

Foundation models pretrained on large-scale datasets are revolutionizing the field of computational pathology (CPath). The generalization ability of foundation models is crucial for the success in various downstream clinical tasks. However, current foundation models have only been evaluated on a limited type and number of tasks, leaving their generalization ability and overall performance unclear. To address this gap, we established a most comprehensive benchmark to evaluate the performance of off-the-shelf foundation models across six distinct clinical task types, encompassing a total of 72 specific tasks, including slide-level classification, survival prediction, ROI-tissue classification, ROI retrieval, visual question answering, and report generation. Our findings reveal that existing foundation models excel at certain task types but struggle to effectively handle the full breadth of clinical tasks. To improve the generalization of pathology foundation models, we propose a unified knowledge distillation framework consisting of both expert and self-knowledge distillation, where the former allows the model to learn from the knowledge of multiple expert models, while the latter leverages self-distillation to enable image representation learning via local-global alignment. Based on this framework, we curated a dataset of 96,000 whole slide images (WSIs) and developed a Generalizable Pathology Foundation Model (GPFM). This advanced model was trained on a substantial dataset comprising 190 million images extracted from approximately 72,000 publicly available slides, encompassing 34 major tissue types. Evaluated on the established benchmark, GPFM achieves an impressive average rank of 1.6, with 42 tasks ranked 1st, while the second-best model, UNI, attains an average rank of 3.7, with only 6 tasks ranked 1st.

CVJun 5, 2022
Accurate Scoliosis Vertebral Landmark Localization on X-ray Images via Shape-constrained Multi-stage Cascaded CNNs

Zhiwei Wang, Jinxin Lv, Yunqiao Yang et al.

Vertebral landmark localization is a crucial step for variant spine-related clinical applications, which requires detecting the corner points of 17 vertebrae. However, the neighbor landmarks often disturb each other for the homogeneous appearance of vertebrae, which makes vertebral landmark localization extremely difficult. In this paper, we propose multi-stage cascaded convolutional neural networks (CNNs) to split the single task into two sequential steps, i.e., center point localization to roughly locate 17 center points of vertebrae, and corner point localization to find 4 corner points for each vertebra without distracted by others. Landmarks in each step are located gradually from a set of initialized points by regressing offsets via cascaded CNNs. Principal Component Analysis (PCA) is employed to preserve a shape constraint in offset regression to resist the mutual attraction of vertebrae. We evaluate our method on the AASCE dataset that consists of 609 tight spinal anterior-posterior X-ray images and each image contains 17 vertebrae composed of the thoracic and lumbar spine for spinal shape characterization. Experimental results demonstrate our superior performance of vertebral landmark localization over other state-of-the-arts with the relative error decreasing from 3.2e-3 to 7.2e-4.

OCSep 23, 2023
Penalties and Rewards for Fair Learning in Paired Kidney Exchange Programs

Margarida Carvalho, Alison Caulfield, Yi Lin et al.

A kidney exchange program, also called a kidney paired donation program, can be viewed as a repeated, dynamic trading and allocation mechanism. This suggests that a dynamic algorithm for transplant exchange selection may have superior performance in comparison to the repeated use of a static algorithm. We confirm this hypothesis using a full scale simulation of the Canadian Kidney Paired Donation Program: learning algorithms, that attempt to learn optimal patient-donor weights in advance via dynamic simulations, do lead to improved outcomes. Specifically, our learning algorithms, designed with the objective of fairness (that is, equity in terms of transplant accessibility across cPRA groups), also lead to an increased number of transplants and shorter average waiting times. Indeed, our highest performing learning algorithm improves egalitarian fairness by 10% whilst also increasing the number of transplants by 6% and decreasing waiting times by 24%. However, our main result is much more surprising. We find that the most critical factor in determining the performance of a kidney exchange program is not the judicious assignment of positive weights (rewards) to patient-donor pairs. Rather, the key factor in increasing the number of transplants, decreasing waiting times and improving group fairness is the judicious assignment of a negative weight (penalty) to the small number of non-directed donors in the kidney exchange program.

CVAug 3, 2024
Leveraging BEV Paradigm for Ground-to-Aerial Image Synthesis

Junyan Ye, Jun He, Weijia Li et al.

Ground-to-aerial image synthesis focuses on generating realistic aerial images from corresponding ground street view images while maintaining consistent content layout, simulating a top-down view. The significant viewpoint difference leads to domain gaps between views, and dense urban scenes limit the visible range of street views, making this cross-view generation task particularly challenging. In this paper, we introduce SkyDiffusion, a novel cross-view generation method for synthesizing aerial images from street view images, utilizing a diffusion model and the Bird's-Eye View (BEV) paradigm. The Curved-BEV method in SkyDiffusion converts street-view images into a BEV perspective, effectively bridging the domain gap, and employs a "multi-to-one" mapping strategy to address occlusion issues in dense urban scenes. Next, SkyDiffusion designed a BEV-guided diffusion model to generate content-consistent and realistic aerial images. Additionally, we introduce a novel dataset, Ground2Aerial-3, designed for diverse ground-to-aerial image synthesis applications, including disaster scene aerial synthesis, low-altitude UAV image synthesis, and historical high-resolution satellite image synthesis tasks. Experimental results demonstrate that SkyDiffusion outperforms state-of-the-art methods on cross-view datasets across natural (CVUSA), suburban (CVACT), urban (VIGOR-Chicago), and various application scenarios (G2A-3), achieving realistic and content-consistent aerial image generation. The code, datasets and more information of this work can be found at https://opendatalab.github.io/skydiffusion/ .

LGApr 16
Logo-LLM: Local and Global Modeling with Large Language Models for Time Series Forecasting

Wenjie Ou, Zhishuo Zhao, Cheng Chen et al.

Time series forecasting is critical across multiple domains, where time series data exhibit both local patterns and global dependencies. While Transformer-based methods effectively capture global dependencies, they often overlook short-term local variations in time series. Recent methods that adapt large language models (LLMs) into time series forecasting inherit this limitation by treating LLMs as black-box encoders, relying solely on the final-layer output and underutilizing hierarchical representations. To address this limitation, we propose Logo-LLM, a novel LLM-based framework that explicitly extracts and models multi-scale temporal features from different layers of a pre-trained LLM. Through empirical analysis, we show that shallow layers of LLMs capture local dynamics in time series, while deeper layers encode global trends. Moreover, Logo-LLM introduces lightweight Local-Mixer and Global-Mixer modules to align and integrate features with the temporal input across layers. Extensive experiments demonstrate that Logo-LLM achieves superior performance across diverse benchmarks, with strong generalization in few-shot and zero-shot settings while maintaining low computational overhead.

CLJan 21
RSNA Large Language Model Benchmark Dataset for Chest Radiographs of Cardiothoracic Disease: Radiologist Evaluation and Validation Enhanced by AI Labels (REVEAL-CXR)

Yishu Wei, Adam E. Flanders, Errol Colak et al.

Multimodal large language models have demonstrated comparable performance to that of radiology trainees on multiple-choice board-style exams. However, to develop clinically useful multimodal LLM tools, high-quality benchmarks curated by domain experts are essential. To curate released and holdout datasets of 100 chest radiographic studies each and propose an artificial intelligence (AI)-assisted expert labeling procedure to allow radiologists to label studies more efficiently. A total of 13,735 deidentified chest radiographs and their corresponding reports from the MIDRC were used. GPT-4o extracted abnormal findings from the reports, which were then mapped to 12 benchmark labels with a locally hosted LLM (Phi-4-Reasoning). From these studies, 1,000 were sampled on the basis of the AI-suggested benchmark labels for expert review; the sampling algorithm ensured that the selected studies were clinically relevant and captured a range of difficulty levels. Seventeen chest radiologists participated, and they marked "Agree all", "Agree mostly" or "Disagree" to indicate their assessment of the correctness of the LLM suggested labels. Each chest radiograph was evaluated by three experts. Of these, at least two radiologists selected "Agree All" for 381 radiographs. From this set, 200 were selected, prioritizing those with less common or multiple finding labels, and divided into 100 released radiographs and 100 reserved as the holdout dataset. The holdout dataset is used exclusively by RSNA to independently evaluate different models. A benchmark of 200 chest radiographic studies with 12 benchmark labels was created and made publicly available https://imaging.rsna.org, with each chest radiograph verified by three radiologists. In addition, an AI-assisted labeling procedure was developed to help radiologists label at scale, minimize unnecessary omissions, and support a semicollaborative environment.

ROFeb 28, 2023
Efficient Implicit Neural Reconstruction Using LiDAR

Dongyu Yan, Xiaoyang Lyu, Jieqi Shi et al.

Modeling scene geometry using implicit neural representation has revealed its advantages in accuracy, flexibility, and low memory usage. Previous approaches have demonstrated impressive results using color or depth images but still have difficulty handling poor light conditions and large-scale scenes. Methods taking global point cloud as input require accurate registration and ground truth coordinate labels, which limits their application scenarios. In this paper, we propose a new method that uses sparse LiDAR point clouds and rough odometry to reconstruct fine-grained implicit occupancy field efficiently within a few minutes. We introduce a new loss function that supervises directly in 3D space without 2D rendering, avoiding information loss. We also manage to refine poses of input frames in an end-to-end manner, creating consistent geometry without global point cloud registration. As far as we know, our method is the first to reconstruct implicit scene representation from LiDAR-only input. Experiments on synthetic and real-world datasets, including indoor and outdoor scenes, prove that our method is effective, efficient, and accurate, obtaining comparable results with existing methods using dense input.

CVApr 16
CXR-LT 2026 Challenge: Multi-Center Long-Tailed and Zero Shot Chest X-ray Classification

Hexin Dong, Yi Lin, Pengyu Zhou et al.

Chest X-ray (CXR) interpretation is hindered by the long-tailed distribution of pathologies and the open-world nature of clinical environments. Existing benchmarks often rely on closed-set classes from a single institution, failing to capture the prevalence of rare diseases or the appearance of novel findings. To address this, we present the CXR-LT challenge. The first event, CXR-LT 2023, established a large-scale benchmark for long-tailed multi-label CXR classification and identified key challenges in rare disease recognition. CXR-LT 2024 further expanded the label space and introduced a zero-shot task to study generalization to unseen findings. Building on the success of CXR-LT 2023 and 2024, this third iteration of the benchmark introduces a multi-center dataset comprising over 145,000 images from PadChest and NIH Chest X-ray datasets. Additionally, all development and test sets in CXR-LT 2026 are annotated by radiologists, providing a more reliable and clinically grounded evaluation than report-derived labels. The challenge defines two core tasks this year: (1) Robust Multi-Label Classification on 30 known classes and (2) Open-World Generalization to 6 unseen (out-of-distribution) rare disease classes. This paper summarizes the overview of the CXR-LT 2026 challenge. We describe the data collection and annotation procedures, analyze solution strategies adopted by participating teams, and evaluate head-versus-tail performance, calibration, and cross-center generalization gaps. Our results show that vision-language foundation models improve both in-distribution and zero-shot performance, but detecting rare findings under multi-center shift remains challenging. Our study provides a foundation for developing and evaluating AI systems in realistic long-tailed and open-world clinical conditions.

CLNov 5, 2025
MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

Kaiyuan Zhang, Chenghao Yang, Zhoufutu Wen et al.

As reasoning models scale rapidly, the essential role of multimodality in human cognition has come into sharp relief, driving a growing need to probe vision-centric cognitive behaviors. Yet, existing multimodal benchmarks either overemphasize textual reasoning or fall short of systematically capturing vision-centric cognitive behaviors, leaving the cognitive capacity of MLLMs insufficiently assessed. To address this limitation, we introduce MME-CC (Multi-Modal Evaluation benchmark of Cognitive Capacity), a vision-grounded benchmark that organizes 11 representative reasoning tasks into three fundamental categories of visual information: spatial, geometric, and knowledge-based reasoning, and provides fine-grained analyses of MLLMs' cognitive capacity across these dimensions. Based on MME-CC, we conduct extensive experiments over 16 representative MLLMs. Our study reveals that closed-source models currently lead overall (e.g., 42.66 for Gemini-2.5-Pro vs. 30.45 for GLM-4.5V), while spatial and geometric reasoning remain broadly weak (less than or equal to 30%). We further identify common error patterns, including orientation mistakes, fragile cross-view identity persistence, and poor adherence to counterfactual instructions, and observe that Chain-of-Thought typically follows a three-stage process (extract -> reason -> verify) with heavy reliance on visual extraction. We hope this work catalyzes a shift toward treating the cognitive capacity of MLLMs as central to both evaluation and model design.

LGNov 14, 2025
Virtual Width Networks

Seed, Baisheng Li, Banggu Wu et al.

We introduce Virtual Width Networks (VWN), a framework that delivers the benefits of wider representations without incurring the quadratic cost of increasing the hidden size. VWN decouples representational width from backbone width, expanding the embedding space while keeping backbone compute nearly constant. In our large-scale experiment, an 8-times expansion accelerates optimization by over 2 times for next-token and 3 times for next-2-token prediction. The advantage amplifies over training as both the loss gap grows and the convergence-speedup ratio increases, showing that VWN is not only token-efficient but also increasingly effective with scale. Moreover, we identify an approximately log-linear scaling relation between virtual width and loss reduction, offering an initial empirical basis and motivation for exploring virtual-width scaling as a new dimension of large-model efficiency.

AIApr 21
Reinforcement Learning Improves LLM Accuracy and Reasoning in Disease Classification from Radiology Reports

Yishu Wei, Yi Lin, Adam Flanders et al.

Accurate disease classification from radiology reports is essential for many applications. While supervised fine-tuning (SFT) of lightweight LLMs improves accuracy, it can degrade reasoning. We propose a two-stage approach: SFT on disease labels followed by Group Relative Policy Optimization (GRPO) to refine predictions by optimizing accuracy and format without reasoning supervision. Across three radiologist-annotated datasets, SFT outperformed baselines and GRPO further improved classification and enhanced reasoning recall and comprehensiveness.

SEApr 10Code
Contract-Coding: Towards Repo-Level Generation via Structured Symbolic Paradigm

Yi Lin, Lujin Zhao, Yijie Shi

The shift toward intent-driven software engineering (often termed "Vibe Coding") exposes a critical Context-Fidelity Trade-off: vague user intents overwhelm linear reasoning chains, leading to architectural collapse in complex repo-level generation. We propose Contract-Coding, a structured symbolic paradigm that bridges unstructured intent and executable code via Autonomous Symbolic Grounding. By projecting ambiguous intents into a formal Language Contract, our framework serves as a Single Source of Truth (SSOT) that enforces topological independence, effectively isolating inter-module implementation details, decreasing topological execution depth and unlocking Architectural Parallelism. Empirically, while state-of-the-art agents suffer from different hallucinations on the Greenfield-5 benchmark, Contract-Coding achieves 47\% functional success while maintaining near-perfect structural integrity. Our work marks a critical step towards repository-scale autonomous engineering: transitioning from strict "specification-following" to robust, intent-driven architecture synthesis. Our code is available at https://github.com/imliinyi/Contract-Coding.

CVJan 15, 2024Code
BoNuS: Boundary Mining for Nuclei Segmentation with Partial Point Labels

Yi Lin, Zeyu Wang, Dong Zhang et al.

Nuclei segmentation is a fundamental prerequisite in the digital pathology workflow. The development of automated methods for nuclei segmentation enables quantitative analysis of the wide existence and large variances in nuclei morphometry in histopathology images. However, manual annotation of tens of thousands of nuclei is tedious and time-consuming, which requires significant amount of human effort and domain-specific expertise. To alleviate this problem, in this paper, we propose a weakly-supervised nuclei segmentation method that only requires partial point labels of nuclei. Specifically, we propose a novel boundary mining framework for nuclei segmentation, named BoNuS, which simultaneously learns nuclei interior and boundary information from the point labels. To achieve this goal, we propose a novel boundary mining loss, which guides the model to learn the boundary information by exploring the pairwise pixel affinity in a multiple-instance learning manner. Then, we consider a more challenging problem, i.e., partial point label, where we propose a nuclei detection module with curriculum learning to detect the missing nuclei with prior morphological knowledge. The proposed method is validated on three public datasets, MoNuSeg, CPM, and CoNIC datasets. Experimental results demonstrate the superior performance of our method to the state-of-the-art weakly-supervised nuclei segmentation methods. Code: https://github.com/hust-linyi/bonus.

IVOct 4, 2023
Continuous 3D Myocardial Motion Tracking via Echocardiography

Chengkang Shen, Hao Zhu, You Zhou et al.

Myocardial motion tracking stands as an essential clinical tool in the prevention and detection of cardiovascular diseases (CVDs), the foremost cause of death globally. However, current techniques suffer from incomplete and inaccurate motion estimation of the myocardium in both spatial and temporal dimensions, hindering the early identification of myocardial dysfunction. To address these challenges, this paper introduces the Neural Cardiac Motion Field (NeuralCMF). NeuralCMF leverages implicit neural representation (INR) to model the 3D structure and the comprehensive 6D forward/backward motion of the heart. This method surpasses pixel-wise limitations by offering the capability to continuously query the precise shape and motion of the myocardium at any specific point throughout the cardiac cycle, enhancing the detailed analysis of cardiac dynamics beyond traditional speckle tracking. Notably, NeuralCMF operates without the need for paired datasets, and its optimization is self-supervised through the physics knowledge priors in both space and time dimensions, ensuring compatibility with both 2D and 3D echocardiogram video inputs. Experimental validations across three representative datasets support the robustness and innovative nature of the NeuralCMF, marking significant advantages over existing state-of-the-art methods in cardiac imaging and motion tracking.

CVAug 18, 2024
Fine-Grained Building Function Recognition from Street-View Images via Geometry-Aware Semi-Supervised Learning

Weijia Li, Jinhua Yu, Dairong Chen et al.

In this work, we propose a geometry-aware semi-supervised framework for fine-grained building function recognition, utilizing geometric relationships among multi-source data to enhance pseudo-label accuracy in semi-supervised learning, broadening its applicability to various building function categorization systems. Firstly, we design an online semi-supervised pre-training stage, which facilitates the precise acquisition of building facade location information in street-view images. In the second stage, we propose a geometry-aware coarse annotation generation module. This module effectively combines GIS data and street-view data based on the geometric relationships, improving the accuracy of pseudo annotations. In the third stage, we combine the newly generated coarse annotations with the existing labeled dataset to achieve fine-grained functional recognition of buildings across multiple cities at a large scale. Extensive experiments demonstrate that our proposed framework exhibits superior performance in fine-grained functional recognition of buildings. Within the same categorization system, it achieves improvements of 7.6\% and 4.8\% compared to fully-supervised methods and state-of-the-art semi-supervised methods, respectively. Additionally, our method also performs well in cross-city scenarios, i.e., extending the model trained on OmniCity (New York) to new cities (i.e., Los Angeles and Boston) with different building function categorization systems. This study offers a new solution for large-scale multi-city applications with minimal annotation requirements, facilitating more efficient data updates and resource allocation in urban management.

CVFeb 25
Overview of the CXR-LT 2026 Challenge: Multi-Center Long-Tailed and Zero Shot Chest X-ray Classification

Hexin Dong, Yi Lin, Pengyu Zhou et al.

Chest X-ray (CXR) interpretation is hindered by the long-tailed distribution of pathologies and the open-world nature of clinical environments. Existing benchmarks often rely on closed-set classes from single institutions, failing to capture the prevalence of rare diseases or the appearance of novel findings. To address this, we present the CXR-LT 2026 challenge. This third iteration of the benchmark introduces a multi-center dataset comprising over 145,000 images from PadChest and NIH Chest X-ray datasets. The challenge defines two core tasks: (1) Robust Multi-Label Classification on 30 known classes and (2) Open-World Generalization to 6 unseen (out-of-distribution) rare disease classes. We report the results of the top-performing teams, evaluating them via mean Average Precision (mAP), AUROC, and F1-score. The winning solutions achieved an mAP of 0.5854 on Task 1 and 0.4315 on Task 2, demonstrating that large-scale vision-language pre-training significantly mitigates the performance drop typically associated with zero-shot diagnosis.

CVJan 3, 2025Code
Merging Context Clustering with Visual State Space Models for Medical Image Segmentation

Yun Zhu, Dong Zhang, Yi Lin et al.

Medical image segmentation demands the aggregation of global and local feature representations, posing a challenge for current methodologies in handling both long-range and short-range feature interactions. Recently, vision mamba (ViM) models have emerged as promising solutions for addressing model complexities by excelling in long-range feature iterations with linear complexity. However, existing ViM approaches overlook the importance of preserving short-range local dependencies by directly flattening spatial tokens and are constrained by fixed scanning patterns that limit the capture of dynamic spatial context information. To address these challenges, we introduce a simple yet effective method named context clustering ViM (CCViM), which incorporates a context clustering module within the existing ViM models to segment image tokens into distinct windows for adaptable local clustering. Our method effectively combines long-range and short-range feature interactions, thereby enhancing spatial contextual representations for medical image segmentation tasks. Extensive experimental evaluations on diverse public datasets, i.e., Kumar, CPM17, ISIC17, ISIC18, and Synapse demonstrate the superior performance of our method compared to current state-of-the-art methods. Our code can be found at https://github.com/zymissy/CCViM.

CVNov 7, 2025
Visual Spatial Tuning

Rui Yang, Ziyu Zhu, Yanwei Li et al.

Capturing spatial relationships from visual inputs is a cornerstone of human-like general intelligence. Several previous studies have tried to enhance the spatial awareness of Vision-Language Models (VLMs) by adding extra expert encoders, which brings extra overhead and usually harms general capabilities. To enhance the spatial ability in general architectures, we introduce Visual Spatial Tuning (VST), a comprehensive framework to cultivate VLMs with human-like visuospatial abilities, from spatial perception to reasoning. We first attempt to enhance spatial perception in VLMs by constructing a large-scale dataset termed VST-P, which comprises 4.1 million samples spanning 19 skills across single views, multiple images, and videos. Then, we present VST-R, a curated dataset with 135K samples that instruct models to reason in space. In particular, we adopt a progressive training pipeline: supervised fine-tuning to build foundational spatial knowledge, followed by reinforcement learning to further improve spatial reasoning abilities. Without the side-effect to general capabilities, the proposed VST consistently achieves state-of-the-art results on several spatial benchmarks, including $34.8\%$ on MMSI-Bench and $61.2\%$ on VSIBench. It turns out that the Vision-Language-Action models can be significantly enhanced with the proposed spatial tuning paradigm, paving the way for more physically grounded AI.

CVMay 13
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

Zhaowei Wang, Lishu Luo, Haodong Duan et al.

Long-context modeling is becoming a core capability of modern large vision-language models (LVLMs), enabling sustained context management across long-document understanding, video analysis, and multi-turn tool use in agentic workflows. Yet practical training recipes remain insufficiently explored, particularly for designing and balancing long-context data mixtures. In this work, we present a systematic study of long-context continued pre-training for LVLMs, extending a 7B model from 32K to 128K context with extensive ablations on long-document data. We first show that long-document VQA is substantially more effective than OCR transcription. Building on this observation, our ablations further yield three key findings: i) for sequence-length distribution, balanced data outperforms target-length-focused data (e.g., 128K), suggesting that long-context ability requires generalizable key-information retrieval across various lengths and positions; ii) retrieval remains the primary bottleneck, favoring retrieval-heavy mixtures with modest reasoning data for task diversity; and iii) pure long-document VQA largely preserves short-context capabilities, suggesting that instruction-formatted long data reduces the need for short-data mixing. Based on these findings, we introduce MMProLong, obtained by long-context continued pre-training from Qwen2.5-VL-7B with only a 5B-token budget. MMProLong improves long-document VQA scores by 7.1% and maintains strong performance at 256K and 512K contexts beyond its 128K training window, without additional training. It further generalizes to webpage-based multimodal needle retrieval, long-context vision-text compression, and long-video understanding without task-specific supervision. Overall, our study establishes a practical LongPT recipe and an empirical foundation for advancing long-context vision-language models.

DCMay 9
MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production

Chunyu Xue, Yangrui Chen, Jianyu Jiang et al.

As the foundational component of versatile AI applications, training an multimodal large language model (MLLM) relies on multimodal datasets with dynamic modality mixture proportions and sample length distributions. However, existing MLLM systems remain inefficient under dynamic workloads, due to statically coupled decisions of resource allocation and model parallelization between encoders and the LLM backbone. This paper presents MegaScale-Omni, an industrial-grade MLLM training system tailored for dynamic workload adaption and hyper-scale deployment. MegaScale-Omni is built upon the training scheme of encoder-LLM multiplexing with three key innovations: (1) Decoupled parallelism strategies with long-short sequence parallelism for encoders to process variable-length samples, and full-fledged 5D parallelism for the LLM backbone, both organized under a communication-efficient parallelization layout. (2) Unified encoder-LLM representations for flexible, extensible colocation, and a new paradigm of encoder-LLM joint pipeline with workload resilience. (3) Workload balancing techniques via decentralized grouped reordering in data loaders and adaptive resharding from encoder to LLM ranks. MegaScale-Omni is deployed as the foundation of our in-house large-scale MLLM training tasks with thousands of GPUs. Our experimental results demonstrate $1.27\times$-$7.57\times$ throughput improvement under production-grade dynamic workloads, as compared to four state-of-the-art systems.

IVMay 6, 2025Code
Rethinking Boundary Detection in Deep Learning-Based Medical Image Segmentation

Yi Lin, Dong Zhang, Xiao Fang et al.

Medical image segmentation is a pivotal task within the realms of medical image analysis and computer vision. While current methods have shown promise in accurately segmenting major regions of interest, the precise segmentation of boundary areas remains challenging. In this study, we propose a novel network architecture named CTO, which combines Convolutional Neural Networks (CNNs), Vision Transformer (ViT) models, and explicit edge detection operators to tackle this challenge. CTO surpasses existing methods in terms of segmentation accuracy and strikes a better balance between accuracy and efficiency, without the need for additional data inputs or label injections. Specifically, CTO adheres to the canonical encoder-decoder network paradigm, with a dual-stream encoder network comprising a mainstream CNN stream for capturing local features and an auxiliary StitchViT stream for integrating long-range dependencies. Furthermore, to enhance the model's ability to learn boundary areas, we introduce a boundary-guided decoder network that employs binary boundary masks generated by dedicated edge detection operators to provide explicit guidance during the decoding process. We validate the performance of CTO through extensive experiments conducted on seven challenging medical image segmentation datasets, namely ISIC 2016, PH2, ISIC 2018, CoNIC, LiTS17, and BTCV. Our experimental results unequivocally demonstrate that CTO achieves state-of-the-art accuracy on these datasets while maintaining competitive model complexity. The codes have been released at: https://github.com/xiaofang007/CTO.

AIApr 17
MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation

Yi Lin, Yihao Ding, Yonghui Wu et al.

Automated 3D radiology report generation often suffers from clinical hallucinations and a lack of the iterative verification found in human practice. While recent Vision-Language Models (VLMs) have advanced the field, they typically operate as monolithic "black-box" systems without the collaborative oversight characteristic of clinical workflows. To address these challenges, we propose MARCH (Multi-Agent Radiology Clinical Hierarchy), a multi-agent framework that emulates the professional hierarchy of radiology departments and assigns specialized roles to distinct agents. MARCH utilizes a Resident Agent for initial drafting with multi-scale CT feature extraction, multiple Fellow Agents for retrieval-augmented revision, and an Attending Agent that orchestrates an iterative, stance-based consensus discourse to resolve diagnostic discrepancies. On the RadGenome-ChestCT dataset, MARCH significantly outperforms state-of-the-art baselines in both clinical fidelity and linguistic accuracy. Our work demonstrates that modeling human-like organizational structures enhances the reliability of AI in high-stakes medical domains.

AIMay 9
Value-Decomposed Reinforcement Learning Framework for Taxiway Routing with Hierarchical Conflict-Aware Observations

Shizhong Zhou, Haifeng Liu, Zheng Zhang et al.

Taxiway routing and on-surface conflict avoidance are coupled safety-critical decision problems in airport surface operations. Existing planning and optimization methods are often limited by online computational cost, while reinforcement learning methods may struggle to represent downstream traffic conflicts and balance multiple objectives. This paper presents Conflict-aware Taxiway Routing (CaTR), a reinforcement learning framework for real-time multi-aircraft taxiway routing. CaTR constructs a grid-based airport surface environment with action masking, introduces a hierarchical foresight traffic representation to encode current and downstream conflict-related traffic conditions, and adopts a value-decomposed reinforcement learning strategy to prioritize sparse but safety-critical objectives. Experiments are conducted on a realistic environment based on Changsha Huanghua International Airport under multiple traffic density levels. Results show that CaTR achieves better safety--efficiency trade-offs than representative planning, optimization, and reinforcement learning baselines while maintaining practical runtime.

CVMay 11, 2025
Seed1.5-VL Technical Report

Dong Guo, Faming Wu, Feida Zhu et al. · pku

We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at https://www.volcengine.com/ (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)

CVOct 4, 2025Code
CoPA: Hierarchical Concept Prompting and Aggregating Network for Explainable Diagnosis

Yiheng Dong, Yi Lin, Xin Yang

The transparency of deep learning models is essential for clinical diagnostics. Concept Bottleneck Model provides clear decision-making processes for diagnosis by transforming the latent space of black-box models into human-understandable concepts. However, concept-based methods still face challenges in concept capture capabilities. These methods often rely on encode features solely from the final layer, neglecting shallow and multiscale features, and lack effective guidance in concept encoding, hindering fine-grained concept extraction. To address these issues, we introduce Concept Prompting and Aggregating (CoPA), a novel framework designed to capture multilayer concepts under prompt guidance. This framework utilizes the Concept-aware Embedding Generator (CEG) to extract concept representations from each layer of the visual encoder. Simultaneously, these representations serve as prompts for Concept Prompt Tuning (CPT), steering the model towards amplifying critical concept-related visual cues. Visual representations from each layer are aggregated to align with textual concept representations. With the proposed method, valuable concept-wise information in the images is captured and utilized effectively, thus improving the performance of concept and disease prediction. Extensive experimental results demonstrate that CoPA outperforms state-of-the-art methods on three public datasets. Code is available at https://github.com/yihengd/CoPA.

CVMay 28, 2025Code
From Controlled Scenarios to Real-World: Cross-Domain Degradation Pattern Matching for All-in-One Image Restoration

Junyu Fan, Chuanlin Liao, Yi Lin

As a fundamental imaging task, All-in-One Image Restoration (AiOIR) aims to achieve image restoration caused by multiple degradation patterns via a single model with unified parameters. Although existing AiOIR approaches obtain promising performance in closed and controlled scenarios, they still suffered from considerable performance reduction in real-world scenarios since the gap of data distributions between the training samples (source domain) and real-world test samples (target domain) can lead inferior degradation awareness ability. To address this issue, a Unified Domain-Adaptive Image Restoration (UDAIR) framework is proposed to effectively achieve AiOIR by leveraging the learned knowledge from source domain to target domain. To improve the degradation identification, a codebook is designed to learn a group of discrete embeddings to denote the degradation patterns, and the cross-sample contrastive learning mechanism is further proposed to capture shared features from different samples of certain degradation. To bridge the data gap, a domain adaptation strategy is proposed to build the feature projection between the source and target domains by dynamically aligning their codebook embeddings, and a correlation alignment-based test-time adaptation mechanism is designed to fine-tune the alignment discrepancies by tightening the degradation embeddings to the corresponding cluster center in the source domain. Experimental results on 10 open-source datasets demonstrate that UDAIR achieves new state-of-the-art performance for the AiOIR task. Most importantly, the feature cluster validate the degradation identification under unknown conditions, and qualitative comparisons showcase robust generalization to real-world scenarios.

CLMay 12, 2025Code
SAS-Bench: A Fine-Grained Benchmark for Evaluating Short Answer Scoring with Large Language Models

Peichao Lai, Kexuan Zhang, Yi Lin et al.

Subjective Answer Grading (SAG) plays a crucial role in education, standardized testing, and automated assessment systems, particularly for evaluating short-form responses in Short Answer Scoring (SAS). However, existing approaches often produce coarse-grained scores and lack detailed reasoning. Although large language models (LLMs) have demonstrated potential as zero-shot evaluators, they remain susceptible to bias, inconsistencies with human judgment, and limited transparency in scoring decisions. To overcome these limitations, we introduce SAS-Bench, a benchmark specifically designed for LLM-based SAS tasks. SAS-Bench provides fine-grained, step-wise scoring, expert-annotated error categories, and a diverse range of question types derived from real-world subject-specific exams. This benchmark facilitates detailed evaluation of model reasoning processes and explainability. We also release an open-source dataset containing 1,030 questions and 4,109 student responses, each annotated by domain experts. Furthermore, we conduct comprehensive experiments with various LLMs, identifying major challenges in scoring science-related questions and highlighting the effectiveness of few-shot prompting in improving scoring accuracy. Our work offers valuable insights into the development of more robust, fair, and educationally meaningful LLM-based evaluation systems.

IVFeb 16, 2022Code
Nuclei Segmentation with Point Annotations from Pathology Images via Self-Supervised Learning and Co-Training

Yi Lin, Zhiyong Qu, Hao Chen et al.

Nuclei segmentation is a crucial task for whole slide image analysis in digital pathology. Generally, the segmentation performance of fully-supervised learning heavily depends on the amount and quality of the annotated data. However, it is time-consuming and expensive for professional pathologists to provide accurate pixel-level ground truth, while it is much easier to get coarse labels such as point annotations. In this paper, we propose a weakly-supervised learning method for nuclei segmentation that only requires point annotations for training. First, coarse pixel-level labels are derived from the point annotations based on the Voronoi diagram and the k-means clustering method to avoid overfitting. Second, a co-training strategy with an exponential moving average method is designed to refine the incomplete supervision of the coarse labels. Third, a self-supervised visual representation learning method is tailored for nuclei segmentation of pathology images that transforms the hematoxylin component images into the H&E stained images to gain better understanding of the relationship between the nuclei and cytoplasm. We comprehensively evaluate the proposed method using two public datasets. Both visual and quantitative results demonstrate the superiority of our method to the state-of-the-art methods, and its competitive performance compared to the fully-supervised methods. Code: https://github.com/hust-linyi/SC-Net

ROMay 21, 2021Code
Fast-Racing: An Open-source Strong Baseline for SE(3) Planning in Autonomous Drone Racing

Zhichao Han, Zhepei Wang, Neng Pan et al.

With the autonomy of aerial robots advances in recent years, autonomous drone racing has drawn increasing attention. In a professional pilot competition, a skilled operator always controls the drone to agilely avoid obstacles in aggressive attitudes, for reaching the destination as fast as possible. Autonomous flight like elite pilots requires planning in SE(3), whose non-triviality and complexity hindering a convincing solution in our community by now. To bridge this gap, this paper proposes an open-source baseline, which includes a high-performance SE(3) planner and a challenging simulation platform tailored for drone racing. We specify the SE(3) trajectory generation as a soft-penalty optimization problem, and speed up the solving process utilizing its underlying parallel structure. Moreover, to provide a testbed for challenging the planner, we develop delicate drone racing tracks which mimic real-world set-up and necessities planning in SE(3). Besides, we provide necessary system components such as common map interfaces and a baseline controller, to make our work plug-in-and-use. With our baseline, we hope to future foster the research of SE(3) planning and the competition of autonomous drone racing.