CVCLFeb 5

PhenoLIP: Integrating Phenotype Ontology Knowledge into Medical Vision-Language Pretraining

arXiv:2602.06184v1h-index: 20
Originality Highly original
AI Analysis

This work addresses the need for more structured and interpretable medical image understanding in medical AI, though it is incremental as it builds on existing CLIP-like models with a novel knowledge integration approach.

The paper tackles the problem of medical vision-language models failing to capture systematic visual knowledge from phenotype ontologies by constructing PhenoKG, a large-scale phenotype-centric multimodal knowledge graph, and proposing PhenoLIP, a pretraining framework that integrates structured phenotype knowledge. The result shows PhenoLIP outperforms previous state-of-the-art baselines, improving phenotype classification accuracy by 8.85% and cross-modal retrieval by 15.03%.

Recent progress in large-scale CLIP-like vision-language models(VLMs) has greatly advanced medical image analysis. However, most existing medical VLMs still rely on coarse image-text contrastive objectives and fail to capture the systematic visual knowledge encoded in well-defined medical phenotype ontologies. To address this gap, we construct PhenoKG, the first large-scale, phenotype-centric multimodal knowledge graph that encompasses over 520K high-quality image-text pairs linked to more than 3,000 phenotypes. Building upon PhenoKG, we propose PhenoLIP, a novel pretraining framework that explicitly incorporates structured phenotype knowledge into medical VLMs through a two-stage process. We first learn a knowledge-enhanced phenotype embedding space from textual ontology data and then distill this structured knowledge into multimodal pretraining via a teacher-guided knowledge distillation objective. To support evaluation, we further introduce PhenoBench, an expert-verified benchmark designed for phenotype recognition, comprising over 7,800 image--caption pairs covering more than 1,000 phenotypes. Extensive experiments demonstrate that PhenoLIP outperforms previous state-of-the-art baselines, improving upon BiomedCLIP in phenotype classification accuracy by 8.85\% and BIOMEDICA in cross-modal retrieval by 15.03%, underscoring the value of integrating phenotype-centric priors into medical VLMs for structured and interpretable medical image understanding.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes