LG CVFeb 29, 2024

Efficient Lifelong Model Evaluation in an Era of Rapid Progress

Ameya Prabhu, Vishaal Udandarao, Philip Torr, Matthias Bethge, Adel Bibi, Samuel Albanie

Cambridge

arXiv:2402.19472v215.08 citationsh-index: 37Has CodeNIPS

Originality Incremental advance

AI Analysis

This addresses the problem of efficient model evaluation for researchers and practitioners in machine learning, though it is incremental as it builds on existing dynamic programming and ranking techniques.

The paper tackles the high computational cost of evaluating many models on large-scale benchmarks by introducing the Sort & Search framework, which reduces compute cost from 180 GPU days to 5 GPU hours (about 1000x reduction) on Lifelong-CIFAR10 and Lifelong-ImageNet with low error.

Standardized benchmarks drive progress in machine learning. However, with repeated testing, the risk of overfitting grows as algorithms over-exploit benchmark idiosyncrasies. In our work, we seek to mitigate this challenge by compiling ever-expanding large-scale benchmarks called Lifelong Benchmarks. These benchmarks introduce a major challenge: the high cost of evaluating a growing number of models across very large sample sets. To address this challenge, we introduce an efficient framework for model evaluation, Sort & Search (S&S)}, which reuses previously evaluated models by leveraging dynamic programming algorithms to selectively rank and sub-select test samples. To test our approach at scale, we create Lifelong-CIFAR10 and Lifelong-ImageNet, containing 1.69M and 1.98M test samples for classification. Extensive empirical evaluations across over 31,000 models demonstrate that S&S achieves highly-efficient approximate accuracy measurement, reducing compute cost from 180 GPU days to 5 GPU hours (about 1000x reduction) on a single A100 GPU, with low approximation error and memory cost of <100MB. Our work also highlights issues with current accuracy prediction metrics, suggesting a need to move towards sample-level evaluation metrics. We hope to guide future research by showing our method's bottleneck lies primarily in generalizing Sort beyond a single rank order and not in improving Search.

View on arXiv PDF Code

Similar