LGAIOct 7, 2025

How NOT to benchmark your SITE metric: Beyond Static Leaderboards and Towards Realistic Evaluation

arXiv:2510.06448v11 citationsh-index: 6
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of misleading evaluation protocols in transferability estimation for researchers, highlighting incremental improvements in benchmarking practices.

The paper identifies fundamental flaws in existing benchmarks for transferability estimation metrics, showing that unrealistic model spaces and static performance hierarchies artificially inflate metric performance, allowing simple heuristics to outperform sophisticated methods. It provides recommendations for constructing more robust benchmarks to better reflect real-world model selection complexities.

Transferability estimation metrics are used to find a high-performing pre-trained model for a given target task without fine-tuning models and without access to the source dataset. Despite the growing interest in developing such metrics, the benchmarks used to measure their progress have gone largely unexamined. In this work, we empirically show the shortcomings of widely used benchmark setups to evaluate transferability estimation metrics. We argue that the benchmarks on which these metrics are evaluated are fundamentally flawed. We empirically demonstrate that their unrealistic model spaces and static performance hierarchies artificially inflate the perceived performance of existing metrics, to the point where simple, dataset-agnostic heuristics can outperform sophisticated methods. Our analysis reveals a critical disconnect between current evaluation protocols and the complexities of real-world model selection. To address this, we provide concrete recommendations for constructing more robust and realistic benchmarks to guide future research in a more meaningful direction.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes