LGDec 9, 2020

Data and its (dis)contents: A survey of dataset development and use in machine learning research

arXiv:2012.05345v1652 citations
AI Analysis

This survey addresses critical practical and ethical problems in machine learning research for the entire ML community, emphasizing the need for improved data practices.

This paper surveys the limitations of current practices in dataset collection and use within machine learning research. It highlights various practical and ethical issues stemming from how data is handled, advocating for a more cautious and thorough understanding of data.

Datasets have played a foundational role in the advancement of machine learning research. They form the basis for the models we design and deploy, as well as our primary medium for benchmarking and evaluation. Furthermore, the ways in which we collect, construct and share these datasets inform the kinds of problems the field pursues and the methods explored in algorithm development. However, recent work from a breadth of perspectives has revealed the limitations of predominant practices in dataset collection and use. In this paper, we survey the many concerns raised about the way we collect and use data in machine learning and advocate that a more cautious and thorough understanding of data is necessary to address several of the practical and ethical issues of the field.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes