Effortless Vision-Language Model Specialization in Histopathology without Annotation
This addresses the need for efficient VLM specialization in histopathology without costly annotations, though it is incremental as it builds on existing VLM frameworks.
The paper tackles the problem of suboptimal performance of general-purpose Vision-Language Models (VLMs) in specific histopathology tasks by proposing an annotation-free adaptation method using continued pretraining on domain-relevant image-caption pairs, resulting in enhanced zero-shot and few-shot performance that matches few-shot methods without manual labeling.
Recent advances in Vision-Language Models (VLMs) in histopathology, such as CONCH and QuiltNet, have demonstrated impressive zero-shot classification capabilities across various tasks. However, their general-purpose design may lead to suboptimal performance in specific downstream applications. While supervised fine-tuning methods address this issue, they require manually labeled samples for adaptation. This paper investigates annotation-free adaptation of VLMs through continued pretraining on domain- and task-relevant image-caption pairs extracted from existing databases. Our experiments on two VLMs, CONCH and QuiltNet, across three downstream tasks reveal that these pairs substantially enhance both zero-shot and few-shot performance. Notably, with larger training sizes, continued pretraining matches the performance of few-shot methods while eliminating manual labeling. Its effectiveness, task-agnostic design, and annotation-free workflow make it a promising pathway for adapting VLMs to new histopathology tasks. Code is available at https://github.com/DeepMicroscopy/Annotation-free-VLM-specialization.