LGAIDec 24, 2024

Navigating Data Corruption in Machine Learning: Balancing Quality, Quantity, and Imputation Strategies

arXiv:2412.18296v24 citationsh-index: 2Future Internet
AI Analysis

This work addresses data corruption challenges for practitioners in real-world ML applications, offering incremental insights into preprocessing and imputation strategies.

The study investigated the impact of data corruption on machine learning performance, finding that noisy data causes more severe degradation than missing data, and that increasing dataset size only partially mitigates these effects, with an empirical rule identifying 30% of data as critical for performance.

Data corruption, including missing and noisy data, poses significant challenges in real-world machine learning. This study investigates the effects of data corruption on model performance and explores strategies to mitigate these effects through two experimental setups: supervised learning with NLP tasks (NLP-SL) and deep reinforcement learning for traffic signal optimization (Signal-RL). We analyze the relationship between data corruption levels and model performance, evaluate the effectiveness of data imputation methods, and assess the utility of enlarging datasets to address data corruption. Our results show that model performance under data corruption follows a diminishing return curve, modeled by the exponential function. Missing data, while detrimental, is less harmful than noisy data, which causes severe performance degradation and training instability, particularly in sequential decision-making tasks like Signal-RL. Imputation strategies involve a trade-off: they recover missing information but may introduce noise. Their effectiveness depends on imputation accuracy and corruption ratio. We identify distinct regions in the imputation advantage heatmap, including an "imputation advantageous corner" and an "imputation disadvantageous edge" and classify tasks as "noise-sensitive" or "noise-insensitive" based on their decision boundaries. Furthermore, we find that increasing dataset size mitigates but cannot fully overcome the effects of data corruption. The marginal utility of additional data diminishes as corruption increases. An empirical rule emerges: approximately 30% of the data is critical for determining performance, while the remaining 70% has minimal impact. These findings provide actionable insights into data preprocessing, imputation strategies, and data collection practices, guiding the development of robust machine learning systems in noisy environments.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes