Analysis of Transferability Estimation Metrics for Surgical Phase Recognition
This work addresses the challenge of model selection in surgical video analysis, where expert annotations are costly, but it is incremental as it benchmarks existing metrics without introducing new methods.
The paper tackled the problem of selecting pre-trained models for surgical phase recognition by benchmarking three transferability estimation metrics (LogME, H-Score, and TransRate) on two datasets, finding that LogME with minimum per-subset aggregation aligns best with fine-tuning accuracy, while H-Score and TransRate perform poorly.
Fine-tuning pre-trained models has become a cornerstone of modern machine learning, allowing practitioners to achieve high performance with limited labeled data. In surgical video analysis, where expert annotations are especially time-consuming and costly, identifying the most suitable pre-trained model for a downstream task is both critical and challenging. Source-independent transferability estimation (SITE) offers a solution by predicting how well a model will fine-tune on target data using only its embeddings or outputs, without requiring full retraining. In this work, we formalize SITE for surgical phase recognition and provide the first comprehensive benchmark of three representative metrics, LogME, H-Score, and TransRate, on two diverse datasets (RAMIE and AutoLaparo). Our results show that LogME, particularly when aggregated by the minimum per-subset score, aligns most closely with fine-tuning accuracy; H-Score yields only weak predictive power; and TransRate often inverses true model rankings. Ablation studies show that when candidate models have similar performances, transferability estimates lose discriminative power, emphasizing the importance of maintaining model diversity or using additional validation. We conclude with practical guidelines for model selection and outline future directions toward domain-specific metrics, theoretical foundations, and interactive benchmarking tools.