Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning
This addresses the need for unified, precise visual grounding and reasoning in clinical applications, offering a single framework for tasks like lesion localization and automated reporting.
The paper tackles the problem of limited generalization in medical imaging models by introducing Citrus-V, a multimodal medical foundation model that integrates detection, segmentation, and chain-of-thought reasoning, outperforming existing models and expert-level systems across multiple benchmarks.
Medical imaging provides critical evidence for clinical diagnosis, treatment planning, and surgical decisions, yet most existing imaging models are narrowly focused and require multiple specialized networks, limiting their generalization. Although large-scale language and multimodal models exhibit strong reasoning and multi-task capabilities, real-world clinical applications demand precise visual grounding, multimodal integration, and chain-of-thought reasoning. We introduce Citrus-V, a multimodal medical foundation model that combines image analysis with textual reasoning. The model integrates detection, segmentation, and multimodal chain-of-thought reasoning, enabling pixel-level lesion localization, structured report generation, and physician-like diagnostic inference in a single framework. We propose a novel multimodal training approach and release a curated open-source data suite covering reasoning, detection, segmentation, and document understanding tasks. Evaluations demonstrate that Citrus-V outperforms existing open-source medical models and expert-level imaging systems across multiple benchmarks, delivering a unified pipeline from visual grounding to clinical reasoning and supporting precise lesion quantification, automated reporting, and reliable second opinions.