Data Excellence for AI: Why Should You Care
It highlights a foundational problem in ML for researchers and practitioners, but is incremental as it critiques existing practices without introducing new methods.
The paper argues that machine learning research overly focuses on algorithms while neglecting the optimization of data, which is crucial for model efficacy, but does not present new experimental results or concrete numbers.
The efficacy of machine learning (ML) models depends on both algorithms and data. Training data defines what we want our models to learn, and testing data provides the means by which their empirical progress is measured. Benchmark datasets define the entire world within which models exist and operate, yet research continues to focus on critiquing and improving the algorithmic aspect of the models rather than critiquing and improving the data with which our models operate. If "data is the new oil," we are still missing work on the refineries by which the data itself could be optimized for more effective use.