CVAIMay 14, 2025

Endo-CLIP: Progressive Self-Supervised Pre-training on Raw Colonoscopy Records

arXiv:2505.09435v12 citationsh-index: 5MICCAI
Originality Incremental advance
AI Analysis

This work addresses challenges in endoscopic image analysis for medical applications, representing an incremental advancement in domain-specific pre-training.

The paper tackled the problem of pre-training on colonoscopy image-text records by introducing Endo-CLIP, a self-supervised framework that improved polyp detection and classification, significantly outperforming state-of-the-art methods in zero-shot and few-shot settings.

Pre-training on image-text colonoscopy records offers substantial potential for improving endoscopic image analysis, but faces challenges including non-informative background images, complex medical terminology, and ambiguous multi-lesion descriptions. We introduce Endo-CLIP, a novel self-supervised framework that enhances Contrastive Language-Image Pre-training (CLIP) for this domain. Endo-CLIP's three-stage framework--cleansing, attunement, and unification--addresses these challenges by (1) removing background frames, (2) leveraging large language models to extract clinical attributes for fine-grained contrastive learning, and (3) employing patient-level cross-attention to resolve multi-polyp ambiguities. Extensive experiments demonstrate that Endo-CLIP significantly outperforms state-of-the-art pre-training methods in zero-shot and few-shot polyp detection and classification, paving the way for more accurate and clinically relevant endoscopic analysis.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes