Yujian Xiong

CV
h-index19
16papers
47citations
Novelty47%
AI Score56

16 Papers

AIJun 1Code
S-SPPO: Semantic-Calibrated Self-Play Preference Optimization

Xiwen Chen, Wenhui Zhu, Jingjing Wang et al.

Aligning Large Language Models (LLMs) with human preferences is often formulated via Direct Preference Optimization (DPO). However, the standard Bradley-Terry instantiation of DPO is limited in modeling common departures from transitivity in human preferences. To address this, recent work has introduced Self-Play Preference Optimization (SPPO), which iteratively refines the policy by training on self-generated win-lose pairs. Our investigation, however, reveals a critical instability in SPPO: the optimization is prone to policy degeneration when the preference oracle assigns overly confident wins to semantically indistinguishable responses. To mitigate this, we propose S-SPPO, a dual-space semantic calibration framework comprising: i) Supervision Calibration via semantic gating, which anneals win rate targets toward the maximum-entropy baseline as semantic overlap increases; and ii) Representation Calibration via latent repulsion to enforce geometric diversity to prevent manifold collapse and maintain latent diversity between chosen and rejected samples. Theoretically, we show that the calibration preserves the constant-sum game structure, facilitating convergence to a Nash Equilibrium. Empirically, S-SPPO avoids the performance degradation seen in prior methods, achieving 52.19% win rate and 47.46% length-controlled win rate on AlpacaEval 2.0 with Llama-3-8B, without using additional human-annotated preferences during training. The code will be available at https://github.com/xiwenc1/s-sppo.

CVMay 27
Mags-RL: Wearing Multimodal LLMs a Magnifying Glass via Agentic Reinforcement Learning For Complex Scene Reasoning

Xuanzhao Dong, Wenhui Zhu, Peijie Qiu et al.

Despite their popularity and success, Multimodal Large Language Models (MLLMs) often struggle to interpret images accurately, which limits their reasoning capability in complex scenarios (e.g., high object density and complex background clutter). Prior work mainly addresses this limitation by incorporating explicit visual cues like bounding boxes that require extra annotations. In addition, the resulting low-resolution crops often miss fine-grained details that MLLMs require for accurate reasoning. Therefore, we propose Mags-RL, an Agentic Reinforcement Learning (RL) framework that equips MLLMs with an external super-resolution "magnifying glass" agent for high-resolution fine-grained inspection. Specifically, the model performs two-round reasoning: in the first round, it generates an initial rationale and autonomously identifies regions of interest without relying on additional annotations; in the second round, it invokes a super-resolution agent to crop and upscale those regions, then revisits and verifies its earlier reasoning to produce the final answer. We also introduce a novel curriculum learning strategy that enables data-efficient RL training, needing as few as only 40 training samples to achieve reasonable performance. Experiments on VSR, TallyQA, and GQA subsets show its superior performance against recent strong competing methods, demonstrating high-quality reasoning with precise visual grounding. Code and weights will be released soon.

CVMay 27
OphIn-500K: Curating Web-Scale Visual Instructions for Scaling Ophthalmic Multimodal Large Language Models

Xuanzhao Dong, Wenhui Zhu, Xiwen Chen et al.

The advancement of general medical Multimodal Large Language Models (MLLMs) has shown great potential for building conversational assistants to support clinical diagnosis. However, their adaptation to highly specialized domains such as ophthalmology remains underexplored, primarily due to the scarcity of large-scale, domain-specific instruction-tuning data. Existing ophthalmic datasets for conversational agents are often limited in scale and largely rely on images from established public benchmarks, limiting the scalability of ophthalmic MLLMs and their ability to capture real-world clinical complexity. To address this gap, we propose $\textbf{OphIn-Engine}$, an ophthalmology-specific instruction data curation pipeline that constructs high-quality instruction data from open-access ophthalmology web-scale videos. The pipeline integrates multimodal transcription for extracting image-transcript pairs, visual cue separation and scoring for identifying clinically relevant visual descriptions, and instruction synthesis with quality control for generating accurate and diverse clinical dialogues. Using this engine, we introduce $\textbf{OphIn-500K}$, a large-scale multimodal ophthalmology instruction-tuning dataset containing over 500,000 instruction instances and more than 151,000 unique images from over 29,000 video clips, formatted as visual question answering (VQA), multi-turn conversational interactions, and chain-of-thought (CoT) reasoning. Built upon this dataset, we further develop $\textbf{OphIn-VL}$, an ophthalmology-specific MLLM with advanced visual understanding and conversational capabilities. Comprehensive experiments and case studies demonstrate that OphIn-VL achieves superior performance compared with state-of-the-art general medical and domain-specific MLLMs.

IVSep 12, 2024Code
Context-Aware Optimal Transport Learning for Retinal Fundus Image Enhancement

Vamsi Krishna Vasa, Peijie Qiu, Wenhui Zhu et al.

Retinal fundus photography offers a non-invasive way to diagnose and monitor a variety of retinal diseases, but is prone to inherent quality glitches arising from systemic imperfections or operator/patient-related factors. However, high-quality retinal images are crucial for carrying out accurate diagnoses and automated analyses. The fundus image enhancement is typically formulated as a distribution alignment problem, by finding a one-to-one mapping between a low-quality image and its high-quality counterpart. This paper proposes a context-informed optimal transport (OT) learning framework for tackling unpaired fundus image enhancement. In contrast to standard generative image enhancement methods, which struggle with handling contextual information (e.g., over-tampered local structures and unwanted artifacts), the proposed context-aware OT learning paradigm better preserves local structures and minimizes unwanted artifacts. Leveraging deep contextual features, we derive the proposed context-aware OT using the earth mover's distance and show that the proposed context-OT has a solid theoretical guarantee. Experimental results on a large-scale dataset demonstrate the superiority of the proposed method over several state-of-the-art supervised and unsupervised methods in terms of signal-to-noise ratio, structural similarity index, as well as two downstream tasks. The code is available at \url{https://github.com/Retinal-Research/Contextual-OT}.

IVSep 17, 2024Code
CUNSB-RFIE: Context-aware Unpaired Neural Schrödinger Bridge in Retinal Fundus Image Enhancement

Xuanzhao Dong, Vamsi Krishna Vasa, Wenhui Zhu et al.

Retinal fundus photography is significant in diagnosing and monitoring retinal diseases. However, systemic imperfections and operator/patient-related factors can hinder the acquisition of high-quality retinal images. Previous efforts in retinal image enhancement primarily relied on GANs, which are limited by the trade-off between training stability and output diversity. In contrast, the Schrödinger Bridge (SB), offers a more stable solution by utilizing Optimal Transport (OT) theory to model a stochastic differential equation (SDE) between two arbitrary distributions. This allows SB to effectively transform low-quality retinal images into their high-quality counterparts. In this work, we leverage the SB framework to propose an image-to-image translation pipeline for retinal image enhancement. Additionally, previous methods often fail to capture fine structural details, such as blood vessels. To address this, we enhance our pipeline by introducing Dynamic Snake Convolution, whose tortuous receptive field can better preserve tubular structures. We name the resulting retinal fundus image enhancement framework the Context-aware Unpaired Neural Schrödinger Bridge (CUNSB-RFIE). To the best of our knowledge, this is the first endeavor to use the SB approach for retinal image enhancement. Experimental results on a large-scale dataset demonstrate the advantage of the proposed method compared to several state-of-the-art supervised and unsupervised methods in terms of image quality and performance on downstream tasks.The code is available at https://github.com/Retinal-Research/CUNSB-RFIE .

SDDec 30, 2025Code
AHA: Aligning Large Audio-Language Models for Reasoning Hallucinations via Counterfactual Hard Negatives

Yanxi Chen, Wenhui Zhu, Xiwen Chen et al.

Although Large Audio-Language Models (LALMs) deliver state-of-the-art (SOTA) performance, they frequently suffer from hallucinations, e.g. generating text not grounded in the audio input. We analyze these grounding failures and identify a distinct taxonomy: Event Omission, False Event Identity, Temporal Relation Error, and Quantitative Temporal Error. To address this, we introduce the AHA (Audio Hallucination Alignment) framework. By leveraging counterfactual hard negative mining, our pipeline constructs a high-quality preference dataset that forces models to distinguish strict acoustic evidence from linguistically plausible fabrications. Additionally, we establish AHA-Eval, a diagnostic benchmark designed to rigorously test these fine-grained temporal reasoning capabilities. We apply this data to align Qwen2.5-Omni. The resulting model, Qwen-Audio-AHA, achieves a 13.7% improvement on AHA-Eval. Crucially, this benefit generalizes beyond our diagnostic set. Our model shows substantial gains on public benchmarks, including 1.3% on MMAU-Test and 1.6% on MMAR, outperforming latest SOTA methods. The model and dataset are open-sourced at https://github.com/LLM-VLM-GSL/AHA.

CVDec 1, 2025
nnMobileNet++: Towards Efficient Hybrid Networks for Retinal Image Analysis

Xin Li, Wenhui Zhu, Xuanzhao Dong et al.

Retinal imaging is a critical, non-invasive modality for the early detection and monitoring of ocular and systemic diseases. Deep learning, particularly convolutional neural networks (CNNs), has significant progress in automated retinal analysis, supporting tasks such as fundus image classification, lesion detection, and vessel segmentation. As a representative lightweight network, nnMobileNet has demonstrated strong performance across multiple retinal benchmarks while remaining computationally efficient. However, purely convolutional architectures inherently struggle to capture long-range dependencies and model the irregular lesions and elongated vascular patterns that characterize on retinal images, despite the critical importance of vascular features for reliable clinical diagnosis. To further advance this line of work and extend the original vision of nnMobileNet, we propose nnMobileNet++, a hybrid architecture that progressively bridges convolutional and transformer representations. The framework integrates three key components: (i) dynamic snake convolution for boundary-aware feature extraction, (ii) stage-specific transformer blocks introduced after the second down-sampling stage for global context modeling, and (iii) retinal image pretraining to improve generalization. Experiments on multiple public retinal datasets for classification, together with ablation studies, demonstrate that nnMobileNet++ achieves state-of-the-art or highly competitive accuracy while maintaining low computational cost, underscoring its potential as a lightweight yet effective framework for retinal image analysis.

CVApr 6
Hierarchical Mesh Transformers with Topology-Guided Pretraining for Morphometric Analysis of Brain Structures

Yujian Xiong, Mohammad Farazi, Yanxi Chen et al.

Representation learning on large-scale unstructured volumetric and surface meshes poses significant challenges in neuroimaging, especially when models must incorporate diverse vertex-level morphometric descriptors, such as cortical thickness, curvature, sulcal depth, and myelin content, which carry subtle disease-related signals. Current approaches either ignore these clinically informative features or support only a single mesh topology, restricting their use across imaging pipelines. We introduce a hierarchical transformer framework designed for heterogeneous mesh analysis that operates on spatially adaptive tree partitions constructed from simplicial complexes of arbitrary order. This design accommodates both volumetric and surface discretizations within a single architecture, enabling efficient multi-scale attention without topology-specific modifications. A feature projection module maps variable-length per-vertex clinical descriptors into the spatial hierarchy, separating geometric structure from feature dimensionality and allowing seamless integration of different neuroimaging feature sets. Self-supervised pretraining via masked reconstruction of both coordinates and morphometric channels on large unlabeled cohorts yields a transferable encoder backbone applicable to diverse downstream tasks and mesh modalities. We validate our approach on Alzheimer's disease classification and amyloid burden prediction using volumetric brain meshes from ADNI, as well as focal cortical dysplasia detection on cortical surface meshes from the MELD dataset, achieving state-of-the-art results across all benchmarks.

CVApr 4
Bridging Restoration and Diagnosis: A Comprehensive Benchmark for Retinal Fundus Enhancement

Xuanzhao Dong, Wenhui Zhu, Xiwen Chen et al.

Over the past decade, generative models have demonstrated success in enhancing fundus images. However, the evaluation of these models remains a challenge. A benchmark for fundus image enhancement is needed for three main reasons:(1) Conventional denoising metrics such as PSNR and SSIM fail to capture clinically relevant features, such as lesion preservation and vessel morphology consistency, limiting their applicability in real-world settings; (2) There is a lack of unified evaluation protocols that address both paired and unpaired enhancement methods, particularly those guided by clinical expertise; and (3) An evaluation framework should provide actionable insights to guide future advancements in clinically aligned enhancement models. To address these gaps, we introduce EyeBench-V2, a benchmark designed to bridge the gap between enhancement model performance and clinical utility. Our work offers three key contributions:(1) Multi-dimensional clinical-alignment through downstream evaluations: Beyond standard enhancement metrics, we assess performance across clinically meaningful tasks including vessel segmentation, diabetic retinopathy (DR) grading, generalization to unseen noise patterns, and lesion segmentation. (2) Expert-guided evaluation design: We curate a novel dataset enabling fair comparisons between paired and unpaired enhancement methods, accompanied by a structured manual assessment protocol by medical experts, which evaluates clinically critical aspects such as lesion structure alterations, background color shifts, and the introduction of artificial structures. (3) Actionable insights: Our benchmark provides a rigorous, task-oriented analysis of existing generative models, equipping clinical researchers with the evidence needed to make informed decisions, while also identifying limitations in current methods to inform the design of next-generation enhancement models.

IVFeb 20, 2025Code
EyeBench: A Call for More Rigorous Evaluation of Retinal Image Enhancement

Wenhui Zhu, Xuanzhao Dong, Xin Li et al.

Over the past decade, generative models have achieved significant success in enhancement fundus images.However, the evaluation of these models still presents a considerable challenge. A comprehensive evaluation benchmark for fundus image enhancement is indispensable for three main reasons: 1) The existing denoising metrics (e.g., PSNR, SSIM) are hardly to extend to downstream real-world clinical research (e.g., Vessel morphology consistency). 2) There is a lack of comprehensive evaluation for both paired and unpaired enhancement methods, along with the need for expert protocols to accurately assess clinical value. 3) An ideal evaluation system should provide insights to inform future developments of fundus image enhancement. To this end, we propose a novel comprehensive benchmark, EyeBench, to provide insights that align enhancement models with clinical needs, offering a foundation for future work to improve the clinical relevance and applicability of generative models for fundus image enhancement. EyeBench has three appealing properties: 1) multi-dimensional clinical alignment downstream evaluation: In addition to evaluating the enhancement task, we provide several clinically significant downstream tasks for fundus images, including vessel segmentation, DR grading, denoising generalization, and lesion segmentation. 2) Medical expert-guided evaluation design: We introduce a novel dataset that promote comprehensive and fair comparisons between paired and unpaired methods and includes a manual evaluation protocol by medical experts. 3) Valuable insights: Our benchmark study provides a comprehensive and rigorous evaluation of existing methods across different downstream tasks, assisting medical experts in making informed choices. Additionally, we offer further analysis of the challenges faced by existing methods. The code is available at \url{https://github.com/Retinal-Research/EyeBench}

CVFeb 22Code
OTPrune: Distribution-Aligned Visual Token Pruning via Optimal Transport

Xiwen Chen, Wenhui Zhu, Gen Li et al.

Multi-modal large language models (MLLMs) achieve strong visual-language reasoning but suffer from high inference cost due to redundant visual tokens. Recent work explores visual token pruning to accelerate inference, while existing pruning methods overlook the underlying distributional structure of visual representations. We propose OTPrune, a training-free framework that formulates pruning as distribution alignment via optimal transport (OT). By minimizing the 2-Wasserstein distance between the full and pruned token distributions, OTPrune preserves both local diversity and global representativeness while reducing inference cost. Moreover, we derive a tractable submodular objective that enables efficient optimization, and theoretically prove its monotonicity and submodularity, providing a principled foundation for stable and efficient pruning. We further provide a comprehensive analysis that explains how distributional alignment contributes to stable and semantically faithful pruning. Comprehensive experiments on wider benchmarks demonstrate that OTPrune achieves superior performance-efficiency tradeoffs compared to state-of-the-art methods. The code is available at https://github.com/xiwenc1/OTPrune.

CVNov 24, 2025Code
VAOT: Vessel-Aware Optimal Transport for Retinal Fundus Enhancement

Xuanzhao Dong, Wenhui Zhu, Yujian Xiong et al.

Color fundus photography (CFP) is central to diagnosing and monitoring retinal disease, yet its acquisition variability (e.g., illumination changes) often degrades image quality, which motivates robust enhancement methods. Unpaired enhancement pipelines are typically GAN-based, however, they can distort clinically critical vasculature, altering vessel topology and endpoint integrity. Motivated by these structural alterations, we propose Vessel-Aware Optimal Transport (\textbf{VAOT}), a framework that combines an optimal-transport objective with two structure-preserving regularizers: (i) a skeleton-based loss to maintain global vascular connectivity and (ii) an endpoint-aware loss to stabilize local termini. These constraints guide learning in the unpaired setting, reducing noise while preserving vessel structure. Experimental results on synthetic degradation benchmark and downstream evaluations in vessel and lesion segmentation demonstrate the superiority of the proposed methods against several state-of-the art baselines. The code is available at https://github.com/Retinal-Research/VAOT

CVDec 3, 2024
Many-MobileNet: Multi-Model Augmentation for Robust Retinal Disease Classification

Hao Wang, Wenhui Zhu, Xuanzhao Dong et al.

In this work, we propose Many-MobileNet, an efficient model fusion strategy for retinal disease classification using lightweight CNN architecture. Our method addresses key challenges such as overfitting and limited dataset variability by training multiple models with distinct data augmentation strategies and different model complexities. Through this fusion technique, we achieved robust generalization in data-scarce domains while balancing computational efficiency with feature extraction capabilities.

CVApr 7, 2024
Reconstructing Retinal Visual Images from 3T fMRI Data Enhanced by Unsupervised Learning

Yujian Xiong, Wenhui Zhu, Zhong-Lin Lu et al.

The reconstruction of human visual inputs from brain activity, particularly through functional Magnetic Resonance Imaging (fMRI), holds promising avenues for unraveling the mechanisms of the human visual system. Despite the significant strides made by deep learning methods in improving the quality and interpretability of visual reconstruction, there remains a substantial demand for high-quality, long-duration, subject-specific 7-Tesla fMRI experiments. The challenge arises in integrating diverse smaller 3-Tesla datasets or accommodating new subjects with brief and low-quality fMRI scans. In response to these constraints, we propose a novel framework that generates enhanced 3T fMRI data through an unsupervised Generative Adversarial Network (GAN), leveraging unpaired training across two distinct fMRI datasets in 7T and 3T, respectively. This approach aims to overcome the limitations of the scarcity of high-quality 7-Tesla data and the challenges associated with brief and low-quality scans in 3-Tesla experiments. In this paper, we demonstrate the reconstruction capabilities of the enhanced 3T fMRI data, highlighting its proficiency in generating superior input visual images compared to data-intensive methods trained and tested on a single subject.

CVJan 19
SGW-GAN: Sliced Gromov-Wasserstein Guided GANs for Retinal Fundus Image Enhancement

Yujian Xiong, Xuanzhao Dong, Wenhui Zhu et al.

Retinal fundus photography is indispensable for ophthalmic screening and diagnosis, yet image quality is often degraded by noise, artifacts, and uneven illumination. Recent GAN- and diffusion-based enhancement methods improve perceptual quality by aligning degraded images with high-quality distributions, but our analysis shows that this focus can distort intra-class geometry: clinically related samples become dispersed, disease-class boundaries blur, and downstream tasks such as grading or lesion detection are harmed. The Gromov Wasserstein (GW) discrepancy offers a principled solution by aligning distributions through internal pairwise distances, naturally preserving intra-class structure, but its high computational cost restricts practical use. To overcome this, we propose SGW-GAN, the first framework to incorporate Sliced GW (SGW) into retinal image enhancement. SGW approximates GW via random projections, retaining relational fidelity while greatly reducing cost. Experiments on public datasets show that SGW-GAN produces visually compelling enhancements, achieves superior diabetic retinopathy grading, and reports the lowest GW discrepancy across disease labels, demonstrating both efficiency and clinical fidelity for unpaired medical image enhancement.

CVApr 1, 2025
Schrödinger Diffusion Driven Signal Recovery in 3T BOLD fMRI Using Unmatched 7T Observations

Yujian Xiong, Xuanzhao Dong, Sebastian Waz et al.

Ultra-high-field (7 Tesla) BOLD fMRI offers exceptional detail in both spatial and temporal domains, along with robust signal-to-noise characteristics, making it a powerful modality for studying visual information processing in the brain. However, due to the limited accessibility of 7T scanners, the majority of neuroimaging studies are still conducted using 3T systems, which inherently suffer from reduced fidelity in both resolution and SNR. To mitigate this limitation, we introduce a new computational approach designed to enhance the quality of 3T BOLD fMRI acquisitions. Specifically, we project both 3T and 7T datasets, sourced from different individuals and experimental setups, into a shared low-dimensional representation space. Within this space, we employ a lightweight, unsupervised Schrödinger Bridge framework to infer a high-SNR, high-resolution counterpart of the 3T data, without relying on paired supervision. This methodology is evaluated across multiple fMRI retinotopy datasets, including synthetically generated samples, and demonstrates a marked improvement in the reliability and fit of population receptive field (pRF) models applied to the enhanced 3T outputs. Our findings suggest that it is feasible to computationally approximate 7T-level quality from standard 3T acquisitions.