CV AIApr 14

Representation geometry shapes task performance in vision-language modeling for CT enterography

Cristian Minoccheri, Emily Wittrup, Kayvan Najarian, Ryan Stidham

arXiv:2604.1302124.8h-index: 39

Predicted impact top 88% in CV · last 90 daysOriginality Incremental advance

AI Analysis

This work provides the first baselines and practical guidance for vision-language modeling in CT enterography, an underexplored modality for inflammatory bowel disease assessment.

This paper presents the first study of vision-language transfer learning on abdominal CT enterography, finding that mean pooling outperforms attention pooling for categorical disease assessment (59.2% three-class accuracy) while attention pooling excels at cross-modal retrieval (0.235 text-to-image MRR). Multi-window RGB encoding outperforms multiplanar sampling, and retrieval-augmented generation improves report generation by 7-14 percentage points above chance.

Computed tomography (CT) enterography is a primary imaging modality for assessing inflammatory bowel disease (IBD), yet the representational choices that best support automated analysis of this modality are unknown. We present the first study of vision-language transfer learning on abdominal CT enterography and identify two main findings. First, mean pooling of slice embeddings gives better categorical disease assessment (59.2\% three-class accuracy), whereas attention pooling gives better cross-modal retrieval (0.235 text-to-image MRR). This pattern holds across all LoRA configurations tested and suggests that the two aggregators emphasize different properties of the learned representation. Second, per-slice tissue contrast matters more than broader spatial coverage: multi-window RGB encoding, which maps complementary Hounsfield Unit windows to RGB channels, outperforms all strategies that increase spatial coverage through multiplanar sampling, and in this setting adding coronal and sagittal views reduces classification performance. For report generation, fine-tuning without retrieval context yields within-1 severity accuracy at the prevalence-matched chance level (70.4\% vs.\ 71\% random), suggesting little learned ordering beyond the class distribution. Retrieval-augmented generation (RAG) improves this across all configurations, scoring 7--14 percentage points above the chance baseline and improving ordinal MAE from 0.98 to 0.80--0.89. A three-teacher pseudolabel framework enables all comparisons without expert annotations. Together, these findings provide the first baselines for this underexplored modality and offer practical guidance for building vision-language systems for volumetric medical imaging.

View on arXiv PDF

Similar