LGNov 22, 2021

On Data-centric Myths

arXiv:2111.11514v14.42 citations

Originality Incremental advance

AI Analysis

This work addresses a foundational problem for the ML community by debunking myths in data-centric theory, though it is incremental as it critiques rather than introduces new methods.

The paper challenges existing intuitions about data-centric principles, showing that minimizing data dimension is not always necessary and preserving data distribution is inessential, using empirical counter-examples.

The community lacks theory-informed guidelines for building good data sets. We analyse theoretical directions relating to what aspects of the data matter and conclude that the intuitions derived from the existing literature are incorrect and misleading. Using empirical counter-examples, we show that 1) data dimension should not necessarily be minimised and 2) when manipulating data, preserving the distribution is inessential. This calls for a more data-aware theoretical understanding. Although not explored in this work, we propose the study of the impact of data modification on learned representations as a promising research direction.

View on arXiv PDF

Similar