To What Extent Do Token-Level Representations from Pathology Foundation Models Improve Dense Prediction?
This work addresses the problem of selecting and adapting PFMs for real-world dense pathology tasks, providing a reproducible benchmark for researchers and practitioners, though it is incremental as it builds on existing PFMs without introducing new methods.
The authors tackled the lack of clear understanding of how pathology foundation models (PFMs) perform in dense prediction tasks like segmentation by creating PFM-DenseBench, a large-scale benchmark evaluating 17 PFMs across 18 public segmentation datasets, revealing insights into when and why different models and tuning strategies succeed or fail.
Pathology foundation models (PFMs) have rapidly advanced and are becoming a common backbone for downstream clinical tasks, offering strong transferability across tissues and institutions. However, for dense prediction (e.g., segmentation), practical deployment still lacks a clear, reproducible understanding of how different PFMs behave across datasets and how adaptation choices affect performance and stability. We present PFM-DenseBench, a large-scale benchmark for dense pathology prediction, evaluating 17 PFMs across 18 public segmentation datasets. Under a unified protocol, we systematically assess PFMs with multiple adaptation and fine-tuning strategies, and derive insightful, practice-oriented findings on when and why different PFMs and tuning choices succeed or fail across heterogeneous datasets. We release containers, configs, and dataset cards to enable reproducible evaluation and informed PFM selection for real-world dense pathology tasks. Project Website: https://m4a1tastegood.github.io/PFM-DenseBench