CLOct 17, 2024

BenTo: Benchmark Task Reduction with In-Context Transferability

arXiv:2410.13804v34 citationsh-index: 13
Originality Incremental advance
AI Analysis

This addresses the efficiency problem for researchers and practitioners evaluating LLMs, though it is incremental as it builds on existing benchmark reduction ideas.

The paper tackles the problem of costly evaluation of large language models by proposing a method to reduce the number of tasks in benchmarks without significantly affecting evaluation quality, achieving a reduction to 5% of tasks with less than a 4% difference in results.

Evaluating large language models (LLMs) is costly: it requires the generation and examination of LLM outputs on a large-scale benchmark of various tasks. This paper investigates how to efficiently reduce the tasks used to benchmark LLMs without affecting the evaluation quality. Our study reveals that task transferability and relevance provide critical information to identify the most representative subset of tasks via optimizing a facility location function. We propose a practically efficient metric for estimating the transferability between two tasks via in-context learning (ICL). By analyzing the pairwise transferability, we can reduce tasks in a modern LLM benchmark (e.g., MMLU or FLAN) to 5% while inducing only a <4% difference to the evaluation on the original benchmark. Compared to prior works, our method is training-free, gradient-free, and highly efficient requiring ICL only.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes