CVOct 9, 2025

Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding

arXiv:2510.08668v255 citationsh-index: 18Has Code
Originality Incremental advance
AI Analysis

This addresses the limitation of existing AI systems in unifying diverse medical data for clinical applications, offering a transparent and reproducible solution, though it is incremental as it builds on existing vision-language model paradigms.

The paper tackles the problem of integrating heterogeneous medical data like text, 2D/3D images, and videos for clinical decision-making by introducing Hulu-Med, a transparent generalist medical vision-language model that unifies these signals, achieving state-of-the-art performance on 27 out of 30 benchmarks and outperforming proprietary systems like GPT-4o on 16 benchmarks.

Real-world clinical decision-making requires integrating heterogeneous data, including medical text, 2D images, 3D volumes, and videos, while existing AI systems fail to unify all these signals, limiting their utility. In this paper, we introduce Hulu-Med, a transparent, generalist medical Vision-Language Model (VLM) designed to unify language-only, 2D/3D vision-language, and video understanding within a single architecture. Hulu-Med is trained on a curated corpus of 16.7 million samples, comprising exclusively public or synthetic data, spanning 12 major anatomical systems and 14 medical imaging modalities. Hulu-Med employs a medical-aware token-reduction strategy that prunes redundant visual tokens, achieving up to a 55% reduction for 3D and video inputs, improving cross-modal efficiency, and enabling training at 7B-32B parameter scales in approximately 4,000-40,000 GPU hours. Across 30 public in-domain and out-of-domain medical benchmarks-covering text reasoning, visual question answering, report generation, multilingual dialogue, video understanding, and rare disease diagnosis-Hulu-Med surpasses existing open-source models on 27 of 30 benchmarks and outperforms proprietary systems such as GPT-4o on 16 benchmarks. Despite being a VLM, Hulu-Med outperforms GPT-4o and matches GPT-o1 on the text-only HealthBench. For the first time in the community, we provide a fully transparent, reproducible and cost-effective pipeline for holistic medical vision-language understanding by releasing our end-to-end data curation, training procedures, and model parameters. Code and models are available at https://github.com/ZJUI-AI4H/Hulu-Med.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes