LGCVFeb 29, 2024

Efficient Lifelong Model Evaluation in an Era of Rapid Progress

Cambridge
arXiv:2402.19472v28 citationsh-index: 37NIPS
AI Analysis

This addresses the problem of efficient model evaluation for researchers and practitioners in machine learning, though it is incremental as it builds on existing dynamic programming and ranking techniques.

The paper tackles the high computational cost of evaluating many models on large-scale benchmarks by introducing the Sort & Search framework, which reduces compute cost from 180 GPU days to 5 GPU hours (about 1000x reduction) on Lifelong-CIFAR10 and Lifelong-ImageNet with low error.

Standardized benchmarks drive progress in machine learning. However, with repeated testing, the risk of overfitting grows as algorithms over-exploit benchmark idiosyncrasies. In our work, we seek to mitigate this challenge by compiling ever-expanding large-scale benchmarks called Lifelong Benchmarks. These benchmarks introduce a major challenge: the high cost of evaluating a growing number of models across very large sample sets. To address this challenge, we introduce an efficient framework for model evaluation, Sort & Search (S&S)}, which reuses previously evaluated models by leveraging dynamic programming algorithms to selectively rank and sub-select test samples. To test our approach at scale, we create Lifelong-CIFAR10 and Lifelong-ImageNet, containing 1.69M and 1.98M test samples for classification. Extensive empirical evaluations across over 31,000 models demonstrate that S&S achieves highly-efficient approximate accuracy measurement, reducing compute cost from 180 GPU days to 5 GPU hours (about 1000x reduction) on a single A100 GPU, with low approximation error and memory cost of <100MB. Our work also highlights issues with current accuracy prediction metrics, suggesting a need to move towards sample-level evaluation metrics. We hope to guide future research by showing our method's bottleneck lies primarily in generalizing Sort beyond a single rank order and not in improving Search.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes