Data-Centric AI Requires Rethinking Data Notion
This foundational work addresses a theoretical problem for AI researchers and practitioners by rethinking data notions, but it is incremental as it builds on existing mathematical concepts.
The paper tackles the need for a unified definition of data in data-centric AI by proposing categorical and cochain notions as unifying principles, which could impact the development and use of machine learning packages.
The transition towards data-centric AI requires revisiting data notions from mathematical and implementational standpoints to obtain unified data-centric machine learning packages. Towards this end, this work proposes unifying principles offered by categorical and cochain notions of data, and discusses the importance of these principles in data-centric AI transition. In the categorical notion, data is viewed as a mathematical structure that we act upon via morphisms to preserve this structure. As for cochain notion, data can be viewed as a function defined in a discrete domain of interest and acted upon via operators. While these notions are almost orthogonal, they provide a unifying definition to view data, ultimately impacting the way machine learning packages are developed, implemented, and utilized by practitioners.