CVJul 10, 2025

THUNDER: Tile-level Histopathology image UNDERstanding benchmark

Pierre Marza, Leo Fillioux, Sofiène Boutaj, Kunal Mahatha, Christian Desrosiers, Pablo Piantanida, Jose Dolz, Stergios Christodoulidis, Maria Vakalopoulou

arXiv:2507.07860v28 citationsh-index: 40Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the need for reliable benchmarking in digital pathology, a critical healthcare domain, by offering a comprehensive tool for researchers and practitioners, though it is incremental as it builds on existing benchmarking concepts.

The authors tackled the challenge of assessing progress in digital pathology by introducing THUNDER, a tile-level benchmark for comparing foundation models, which evaluated 23 models on 16 datasets across diverse tasks and provided insights into feature spaces, robustness, and uncertainty.

Progress in a research field can be hard to assess, in particular when many concurrent methods are proposed in a short period of time. This is the case in digital pathology, where many foundation models have been released recently to serve as feature extractors for tile-level images, being used in a variety of downstream tasks, both for tile- and slide-level problems. Benchmarking available methods then becomes paramount to get a clearer view of the research landscape. In particular, in critical domains such as healthcare, a benchmark should not only focus on evaluating downstream performance, but also provide insights about the main differences between methods, and importantly, further consider uncertainty and robustness to ensure a reliable usage of proposed models. For these reasons, we introduce THUNDER, a tile-level benchmark for digital pathology foundation models, allowing for efficient comparison of many models on diverse datasets with a series of downstream tasks, studying their feature spaces and assessing the robustness and uncertainty of predictions informed by their embeddings. THUNDER is a fast, easy-to-use, dynamic benchmark that can already support a large variety of state-of-the-art foundation, as well as local user-defined models for direct tile-based comparison. In this paper, we provide a comprehensive comparison of 23 foundation models on 16 different datasets covering diverse tasks, feature analysis, and robustness. The code for THUNDER is publicly available at https://github.com/MICS-Lab/thunder.

View on arXiv PDF Code

Similar