CVFeb 28

U-VLM: Hierarchical Vision Language Modeling for Report Generation

Pengcheng Shi, Minghui Zhang, Kehan Song, Jiaqi Liu, Yun Gu, Xinglin Zhang

arXiv:2603.00479v12.81 citationsh-index: 36Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of reducing radiologist workload and improving diagnostic consistency in medical imaging, though it is incremental as it builds on existing vision-language models with specific architectural enhancements.

The paper tackles automated radiology report generation for 3D medical imaging by proposing U-VLM, a hierarchical vision-language model that uses progressive training and multi-layer visual injection, achieving state-of-the-art performance with improvements such as F1 scores of 0.414 vs. 0.258 on CT-RATE and 0.624 vs. 0.518 on AbdomenAtlas 3.0.

Automated radiology report generation is key for reducing radiologist workload and improving diagnostic consistency, yet generating accurate reports for 3D medical imaging remains challenging. Existing vision-language models face two limitations: they do not leverage segmentation-pretrained encoders, and they inject visual features only at the input layer of language models, losing multi-scale information. We propose U-VLM, which enables hierarchical vision-language modeling in both training and architecture: (1) progressive training from segmentation to classification to report generation, and (2) multi-layer visual injection that routes U-Net encoder features to corresponding language model layers. Each training stage can leverage different datasets without unified annotations. U-VLM achieves state-of-the-art performance on CT-RATE (F1: 0.414 vs 0.258, BLEU-mean: 0.349 vs 0.305) and AbdomenAtlas 3.0 (F1: 0.624 vs 0.518 for segmentation-based detection) using only a 0.1B decoder trained from scratch, demonstrating that well-designed vision encoder pretraining outweighs the benefits of 7B+ pre-trained language models. Ablation studies show that progressive pretraining significantly improves F1, while multi-layer injection improves BLEU-mean. Code is available at https://github.com/yinghemedical/U-VLM.

View on arXiv PDF Code

Similar