LG AIDec 8, 2024

Towards Modeling Data Quality and Machine Learning Model Performance

Usman Anjum, Chris Trentman, Elrod Caden, Justin Zhan

arXiv:2412.05882v14.61 citationsh-index: 4Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the need for better trust and performance measurement in machine learning by modeling data quality, but it is incremental as it builds on existing concepts like SNR.

The paper tackled the problem of quantifying how data uncertainty and noise affect machine learning model performance by proposing a new metric called deterministic-non-deterministic ratio (DDR) based on signal-to-noise ratio, and demonstrated through synthetic data experiments that accuracy varies with DDR, enabling the use of DDR-accuracy curves to assess model performance.

Understanding the effect of uncertainty and noise in data on machine learning models (MLM) is crucial in developing trust and measuring performance. In this paper, a new model is proposed to quantify uncertainties and noise in data on MLMs. Using the concept of signal-to-noise ratio (SNR), a new metric called deterministic-non-deterministic ratio (DDR) is proposed to formulate performance of a model. Using synthetic data in experiments, we show how accuracy can change with DDR and how we can use DDR-accuracy curves to determine performance of a model.

View on arXiv PDF Code

Similar