CVMar 6, 2025

Semantic Alignment of Unimodal Medical Text and Vision Representations

Maxime Di Folco, Emily Chan, Marta Hasny, Cosmin I. Bercea, Julia A. Schnabel

arXiv:2503.04478v13.6h-index: 8

Originality Incremental advance

AI Analysis

This addresses the need for better AI integration in medical domains without extensive retraining, though it is incremental as it builds on existing alignment techniques.

The paper tackles the problem of general-purpose AI models underperforming in medical imaging by using semantic alignment to bridge them with specialized knowledge, demonstrating improved performance on chest X-ray tasks and introducing a zero-shot classification approach that outperforms general multimodal models.

General-purpose AI models, particularly those designed for text and vision, demonstrate impressive versatility across a wide range of deep-learning tasks. However, they often underperform in specialised domains like medical imaging, where domain-specific solutions or alternative knowledge transfer approaches are typically required. Recent studies have noted that general-purpose models can exhibit similar latent spaces when processing semantically related data, although this alignment does not occur naturally. Building on this insight, it has been shown that applying a simple transformation - at most affine - estimated from a subset of semantically corresponding samples, known as anchors, enables model stitching across diverse training paradigms, architectures, and modalities. In this paper, we explore how semantic alignment - estimating transformations between anchors - can bridge general-purpose AI with specialised medical knowledge. Using multiple public chest X-ray datasets, we demonstrate that model stitching across model architectures allows general models to integrate domain-specific knowledge without additional training, leading to improved performance on medical tasks. Furthermore, we introduce a novel zero-shot classification approach for unimodal vision encoders that leverages semantic alignment across modalities. Our results show that our method not only outperforms general multimodal models but also approaches the performance levels of fully trained, medical-specific multimodal solutions

View on arXiv PDF

Similar