AI Competitions and Benchmarks: Dataset Development
This work addresses the problem of dataset development for machine learning practitioners, but it is incremental as it synthesizes existing tools and experiences without introducing new methods.
This chapter tackles the intricate process of collecting and transforming data for machine learning, which often requires meticulous manual preparation and can lead to risks like social discrimination or critical failures if rushed, by providing a comprehensive overview of established methodological tools and practical insights for dataset development, including tasks like requirements, design, implementation, evaluation, distribution, and maintenance.
Machine learning is now used in many applications thanks to its ability to predict, generate, or discover patterns from large quantities of data. However, the process of collecting and transforming data for practical use is intricate. Even in today's digital era, where substantial data is generated daily, it is uncommon for it to be readily usable; most often, it necessitates meticulous manual data preparation. The haste in developing new models can frequently result in various shortcomings, potentially posing risks when deployed in real-world scenarios (eg social discrimination, critical failures), leading to the failure or substantial escalation of costs in AI-based projects. This chapter provides a comprehensive overview of established methodological tools, enriched by our practical experience, in the development of datasets for machine learning. Initially, we develop the tasks involved in dataset development and offer insights into their effective management (including requirements, design, implementation, evaluation, distribution, and maintenance). Then, we provide more details about the implementation process which includes data collection, transformation, and quality evaluation. Finally, we address practical considerations regarding dataset distribution and maintenance.