LGOct 13, 2021

Newer is not always better: Rethinking transferability metrics, their peculiarities, stability and performance

Shibal Ibrahim, Natalia Ponomareva, Rahul Mazumder

arXiv:2110.06893v311.922 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of efficiently selecting pre-trained models for fine-tuning, which is crucial for practitioners with limited resources, though it is incremental as it builds on existing metrics.

The paper tackled the problem of poor performance in transferability metrics like H-score due to statistical issues with covariance estimation, and proposed a shrinkage-based estimator that achieved up to 80% absolute gain in correlation performance, making it competitive with state-of-the-art methods while being 3-10 times faster. It also identified and corrected overlooked issues in target task selection settings for metrics like NCE and LEEP, supporting findings with extensive experiments on vision models and graph neural networks.

Fine-tuning of large pre-trained image and language models on small customized datasets has become increasingly popular for improved prediction and efficient use of limited resources. Fine-tuning requires identification of best models to transfer-learn from and quantifying transferability prevents expensive re-training on all of the candidate models/tasks pairs. In this paper, we show that the statistical problems with covariance estimation drive the poor performance of H-score -- a common baseline for newer metrics -- and propose shrinkage-based estimator. This results in up to 80% absolute gain in H-score correlation performance, making it competitive with the state-of-the-art LogME measure. Our shrinkage-based H-score is $3\times$-10$\times$ faster to compute compared to LogME. Additionally, we look into a less common setting of target (as opposed to source) task selection. We demonstrate previously overlooked problems in such settings with different number of labels, class-imbalance ratios etc. for some recent metrics e.g., NCE, LEEP that resulted in them being misrepresented as leading measures. We propose a correction and recommend measuring correlation performance against relative accuracy in such settings. We support our findings with ~164,000 (fine-tuning trials) experiments on both vision models and graph neural networks.

View on arXiv PDF

Similar