Data Acquisition: A New Frontier in Data-centric AI
This addresses data acquisition issues for the data-centric AI community, but it appears incremental as it builds on existing benchmarks like DataPerf.
The paper tackles the problem of data acquisition challenges in machine learning by investigating current data marketplaces and introducing the DAM challenge benchmark. The evaluation of submitted strategies demonstrates the need for effective data acquisition strategies, though no concrete numerical results are provided.
As Machine Learning (ML) systems continue to grow, the demand for relevant and comprehensive datasets becomes imperative. There is limited study on the challenges of data acquisition due to ad-hoc processes and lack of consistent methodologies. We first present an investigation of current data marketplaces, revealing lack of platforms offering detailed information about datasets, transparent pricing, standardized data formats. With the objective of inciting participation from the data-centric AI community, we then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers. The benchmark was released as a part of DataPerf. Our evaluation of the submitted strategies underlines the need for effective data acquisition strategies in ML.