CVFeb 21

Initialization matters in few-shot adaptation of vision-language models for histopathological image classification

arXiv:2602.18766v1
Originality Incremental advance
AI Analysis

This addresses the challenge of efficient transfer learning for whole-slide image classification in medical imaging, though it is incremental as it builds on existing MIL frameworks and initialization techniques.

The paper tackles the problem of few-shot adaptation of vision-language models for histopathological image classification by proposing Zero-Shot Multiple-Instance Learning (ZS-MIL), which uses class-level embeddings from the text encoder as initialization for the classifier, resulting in improved robustness and performance compared to random initialization in experiments.

Vision language models (VLM) pre-trained on datasets of histopathological image-caption pairs enabled zero-shot slide-level classification. The ability of VLM image encoders to extract discriminative features also opens the door for supervised fine-tuning for whole-slide image (WSI) classification, ideally using few labeled samples. Slide-level prediction frameworks require the incorporation of multiple instance learning (MIL) due to the gigapixel size of the WSI. Following patch-level feature extraction and aggregation, MIL frameworks rely on linear classifiers trained on top of the slide-level aggregated features. Classifier weight initialization has a large influence on Linear Probing performance in efficient transfer learning (ETL) approaches based on few-shot learning. In this work, we propose Zero-Shot Multiple-Instance Learning (ZS-MIL) to address the limitations of random classifier initialization that underperform zero-shot prediction in MIL problems. ZS-MIL uses the class-level embeddings of the VLM text encoder as the classification layer's starting point to compute each sample's bag-level probabilities. Through multiple experiments, we demonstrate the robustness of ZS-MIL compared to well-known weight initialization techniques both in terms of performance and variability in an ETL few-shot scenario for subtyping prediction.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes