AILGDec 9, 2022

Measuring Data

Hugging FaceSalesforce
arXiv:2212.05129v221 citationsh-index: 39
AI Analysis

This work addresses the need for standardized data analysis methods in ML research and practice, but it is incremental as it builds on existing research without introducing new paradigms or specific gains.

The paper tackles the problem of quantitatively characterizing machine learning data and datasets by proposing data measurements as a way to systematically analyze and compare data attributes, aiming to support responsible AI development and better control over what ML systems learn.

We identify the task of measuring data to quantitatively characterize the composition of machine learning data and datasets. Similar to an object's height, width, and volume, data measurements quantify different attributes of data along common dimensions that support comparison. Several lines of research have proposed what we refer to as measurements, with differing terminology; we bring some of this work together, particularly in fields of computer vision and language, and build from it to motivate measuring data as a critical component of responsible AI development. Measuring data aids in systematically building and analyzing machine learning (ML) data towards specific goals and gaining better control of what modern ML systems will learn. We conclude with a discussion of the many avenues of future work, the limitations of data measurements, and how to leverage these measurement approaches in research and practice.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes