Suyang Xi

CV
h-index5
4papers
8citations
Novelty59%
AI Score46

4 Papers

75.6CVApr 10
MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering

Suyang Xi, Songtao Hu, Yuxiang Lai et al.

Medical vision--language models (VLMs) have shown strong potential for medical visual question answering (VQA), yet their reasoning remains largely text-centric: images are encoded once as static context, and subsequent inference is dominated by language. This paradigm is fundamentally limited in clinical scenarios, where accurate answers often depend on subtle, localized visual evidence that cannot be reliably preserved in static embeddings. We propose \textsc{MedLVR}, a latent visual reasoning framework that introduces an explicit visual evidence state into autoregressive decoding. Instead of relying solely on text-based intermediate reasoning, \textsc{MedLVR} interleaves a short latent reasoning segment within the decoder by reusing hidden states as continuous latent steps, enabling iterative preservation and refinement of query-relevant visual evidence before answer generation. To support effective visual supervision, we adopt a two-stage training strategy: region of interest (ROI)-supervised fine-tuning aligns latent states with clinically relevant image evidence, and Visual-Latent Policy Optimization (VLPO) further optimizes latent reasoning and answer generation under outcome-level rewards. Experiments on OmniMedVQA and five external medical VQA benchmarks show that \textsc{MedLVR} consistently outperforms recent reasoning baselines and improves the average score over the Qwen2.5-VL-7B backbone from 48.3\% to 53.4\%. These results show that latent visual reasoning provides an effective mechanism for preserving diagnostically relevant visual evidence and improving the reliability of medical VQA.

IVJun 22, 2025
Multimodal Medical Image Binding via Shared Text Embeddings

Yunhao Liu, Suyang Xi, Shiqi Liu et al.

Medical image analysis increasingly relies on the integration of multiple imaging modalities to capture complementary anatomical and functional information, enabling more accurate diagnosis and treatment planning. Achieving aligned feature representations across these diverse modalities is therefore important for effective multimodal analysis. While contrastive language-image pre-training (CLIP) and its variant have enabled image-text alignments, they require explicitly paired data between arbitrary two modalities, which is difficult to acquire in medical contexts. To address the gap, we present Multimodal Medical Image Binding with Text (M\textsuperscript{3}Bind), a novel pre-training framework that enables seamless alignment of multiple medical imaging modalities through a shared text representation space without requiring explicit paired data between any two medical image modalities. Specifically, based on the insight that different images can naturally bind with text, M\textsuperscript{3}Bind first fine-tunes pre-trained CLIP-like image-text models to align their modality-specific text embedding space while preserving their original image-text alignments. Subsequently, we distill these modality-specific text encoders into a unified model, creating a shared text embedding space. Experiments on X-ray, CT, retina, ECG, and pathological images on multiple downstream tasks demonstrate that M\textsuperscript{3}Bind achieves state-of-the-art performance in zero-shot, few-shot classification and cross-modal retrieval tasks compared to its CLIP-like counterparts. These results validate M\textsuperscript{3}Bind's effectiveness in achieving cross-image-modal alignment for medical analysis.

CVOct 12, 2025
Taming a Retrieval Framework to Read Images in Humanlike Manner for Augmenting Generation of MLLMs

Suyang Xi, Chenxi Yang, Hong Ding et al.

Multimodal large language models (MLLMs) often fail in fine-grained visual question answering, producing hallucinations about object identities, positions, and relations because textual queries are not explicitly anchored to visual referents. Retrieval-augmented generation (RAG) alleviates some errors, but it fails to align with human-like processing at both the retrieval and augmentation levels. Specifically, it focuses only on global-level image information but lacks local detail and limits reasoning about fine-grained interactions. To overcome this limitation, we present Human-Like Retrieval-Augmented Generation (HuLiRAG), a framework that stages multimodal reasoning as a ``what--where--reweight'' cascade. Queries are first anchored to candidate referents via open-vocabulary detection (what), then spatially resolved with SAM-derived masks to recover fine-grained precision (where), and adaptively prioritized through the trade-off between local and global alignment (reweight). Mask-guided fine-tuning further injects spatial evidence into the generation process, transforming grounding from a passive bias into an explicit constraint on answer formulation. Extensive experiments demonstrate that this human-like cascade improves grounding fidelity and factual consistency while reducing hallucinations, advancing multimodal question answering toward trustworthy reasoning.

RONov 22, 2024
Enhancing Autonomous Driving Safety through World Model-Based Predictive Navigation and Adaptive Learning Algorithms for 5G Wireless Applications

Hong Ding, Ziming Wang, Yi Ding et al.

Addressing the challenge of ensuring safety in ever-changing and unpredictable environments, particularly in the swiftly advancing realm of autonomous driving in today's 5G wireless communication world, we present Navigation Secure (NavSecure). This vision-based navigation framework merges the strengths of world models with crucial safety-focused decision-making capabilities, enabling autonomous vehicles to navigate real-world complexities securely. Our approach anticipates potential threats and formulates safer routes by harnessing the predictive capabilities of world models, thus significantly reducing the need for extensive real-world trial-and-error learning. Additionally, our method empowers vehicles to autonomously learn and develop through continuous practice, ensuring the system evolves and adapts to new challenges. Incorporating radio frequency technology, NavSecure leverages 5G networks to enhance real-time data exchange, improving communication and responsiveness. Validated through rigorous experiments under simulation-to-real driving conditions, NavSecure has shown exceptional performance in safety-critical scenarios, such as sudden obstacle avoidance. Results indicate that NavSecure excels in key safety metrics, including collision prevention and risk reduction, surpassing other end-to-end methodologies. This framework not only advances autonomous driving safety but also demonstrates how world models can enhance decision-making in critical applications. NavSecure sets a new standard for developing more robust and trustworthy autonomous driving systems, capable of handling the inherent dynamics and uncertainties of real-world environments.