How Hyper-Datafication Impacts the Sustainability Costs in Frontier AI
For AI researchers and policymakers, this work highlights the often-overlooked sustainability costs of data creation in frontier AI, but the analysis is largely descriptive and the recommendations are high-level.
This paper introduces the concept of 'hyper-datafication' to describe the shift from using existing data to actively creating data for frontier AI, and analyzes its environmental, social, and economic costs by examining ~550,000 datasets from Hugging Face and qualitative data from Kenyan workers. It finds that hyper-datafication systematically redistributes burdens toward the Global South and under-represented groups, and proposes Data PROOFS recommendations to mitigate these costs.
Large-scale data has fuelled the success of frontier artificial intelligence (AI) models over the past decade. This expansion has relied on sustained efforts by large technology corporations to aggregate and curate internet-scale datasets. In this work, we examine the environmental, social, and economic costs of large-scale data in AI through a sustainability lens. We argue that the field is shifting from building models from data to actively creating data for building models. We characterise this transition as hyper-datafication, which marks a critical juncture for the future of frontier AI and its societal impacts. To quantify and contextualise data-related costs, we analyse approximately 550,000 datasets from the Hugging Face Hub, focusing on dataset growth, storage-related energy consumption and carbon footprint, and societal representation using language data. We complement this analysis with qualitative responses from data workers in Kenya to examine the labour involved, including direct employment by big tech corporations and exposure to graphic content. We further draw on external data sources to substantiate our findings by illustrating the global disparity in data centre infrastructure. Our analyses reveal that hyper-datafication does not merely increase resource consumption but systematically redistributes environmental burdens, labour risks, and representational harms toward the Global South, precarious data workers, and under-represented cultures. Thus, we propose Data PROOFS recommendations spanning provenance, resource awareness, ownership, openness, frugality, and standards to mitigate these costs. Our work aims to make visible the often-overlooked costs of data that underpin frontier AI and to stimulate broader debate within the research community and beyond.