CV AI LGSep 1, 2025

Unified Supervision For Vision-Language Modeling in 3D Computed Tomography

Hao-Chih Lee, Zelong Liu, Hamza Ahmed, Spencer Kim, Sean Huver, Vishwesh Nath, Zahi A. Fayad, Timothy Deyer, Xueyan Mei

arXiv:2509.01554v110.24 citationsh-index: 142025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

Originality Highly original

AI Analysis

This work addresses the problem of unreliable clinical use of VLMs in high-stakes radiology for medical professionals, representing a strong specific gain rather than an incremental improvement.

The paper tackled the challenge of insufficient discriminative precision in general-purpose vision-language models for diagnostic radiology by introducing Uniferum, a volumetric VLM that unifies diverse supervision signals from classification labels and segmentation masks, achieving a 7% improvement in AUROC on the CT-RATE benchmark compared to existing models.

General-purpose vision-language models (VLMs) have emerged as promising tools in radiology, offering zero-shot capabilities that mitigate the need for large labeled datasets. However, in high-stakes domains like diagnostic radiology, these models often lack the discriminative precision required for reliable clinical use. This challenge is compounded by the scarcity and heterogeneity of publicly available volumetric CT datasets, which vary widely in annotation formats and granularity. To address these limitations, we introduce Uniferum, a volumetric VLM that unifies diverse supervision signals, encoded in classification labels and segmentation masks, into a single training framework. By harmonizing three public 3D CT datasets with distinct annotations, Uniferum achieves state-of-the-art performance, improving AUROC on the CT-RATE benchmark by 7% compared to CLIP-based and conventional multi-label convolutional models. The model demonstrates robust out-of-distribution generalization, with observed evidence of unexpected zero-shot performance on the RAD-CHEST and INSPECT datasets. Our results highlight the effectiveness of integrating heterogeneous annotations and body segmentation to enhance model performance, setting a new direction for clinically reliable, data-efficient VLMs in 3D medical imaging.

View on arXiv PDF

Similar