DB AIJan 2, 2020

Informal Data Transformation Considered Harmful

arXiv:2001.00338v13 citations

AI Analysis

This addresses inefficiencies in enterprise data management for data scientists, though it appears incremental as it builds on existing ideas of formal guarantees.

The paper tackles the problem of data integrity in AI systems by proposing to formally and automatically guarantee integrity during data transformations, rather than cleaning data after the fact, which currently consumes 80% of data scientists' time.

In this paper we take the common position that AI systems are limited more by the integrity of the data they are learning from than the sophistication of their algorithms, and we take the uncommon position that the solution to achieving better data integrity in the enterprise is not to clean and validate data ex-post-facto whenever needed (the so-called data lake approach to data management, which can lead to data scientists spending 80% of their time cleaning data), but rather to formally and automatically guarantee that data integrity is preserved as it transformed (migrated, integrated, composed, queried, viewed, etc) throughout the enterprise, so that data and programs that depend on that data need not constantly be re-validated for every particular use.

View on arXiv PDF

Similar