LG AI SYAug 1, 2022

A Case for Dataset Specific Profiling

Seth Ockerman, John Wu, Christopher Stewart

arXiv:2208.03315v11.8h-index: 3

Originality Highly original

AI Analysis

This addresses the issue of suboptimal model selection for researchers and practitioners in data-driven science, proposing a new paradigm to reduce bias and explore unique dataset characteristics.

The paper tackles the problem of biased model selection in data-driven science by showing that using representative datasets for benchmarking can significantly alter model rankings compared to dataset-specific profiling, and demonstrates that lightweight model execution can improve accuracy.

Data-driven science is an emerging paradigm where scientific discoveries depend on the execution of computational AI models against rich, discipline-specific datasets. With modern machine learning frameworks, anyone can develop and execute computational models that reveal concepts hidden in the data that could enable scientific applications. For important and widely used datasets, computing the performance of every computational model that can run against a dataset is cost prohibitive in terms of cloud resources. Benchmarking approaches used in practice use representative datasets to infer performance without actually executing models. While practicable, these approaches limit extensive dataset profiling to a few datasets and introduce bias that favors models suited for representative datasets. As a result, each dataset's unique characteristics are left unexplored and subpar models are selected based on inference from generalized datasets. This necessitates a new paradigm that introduces dataset profiling into the model selection process. To demonstrate the need for dataset-specific profiling, we answer two questions:(1) Can scientific datasets significantly permute the rank order of computational models compared to widely used representative datasets? (2) If so, could lightweight model execution improve benchmarking accuracy? Taken together, the answers to these questions lay the foundation for a new dataset-aware benchmarking paradigm.

View on arXiv PDF

Similar