LG DATA-ANJul 15, 2025

Exploring the Frontiers of kNN Noisy Feature Detection and Recovery for Self-Driving Labs

Qiuyu Shi, Kangming Li, Yao Fehlis, Daniel Persaud, Robert Black, Jason Hattrick-Simpers

arXiv:2507.16833v14.1h-index: 11Machine Learning: Science and Technology

Originality Synthesis-oriented

AI Analysis

This work addresses data quality issues in automated materials discovery, but it is incremental as it applies existing kNN methods to a specific domain.

The study tackled the problem of noisy features corrupting data in self-driving labs for materials discovery by developing an automated workflow to detect and recover correct feature values, finding that high-intensity noise and large datasets improve detection and correction, with continuous distributions showing greater recoverability.

Self-driving laboratories (SDLs) have shown promise to accelerate materials discovery by integrating machine learning with automated experimental platforms. However, errors in the capture of input parameters may corrupt the features used to model system performance, compromising current and future campaigns. This study develops an automated workflow to systematically detect noisy features, determine sample-feature pairings that can be corrected, and finally recover the correct feature values. A systematic study is then performed to examine how dataset size, noise intensity, and feature value distribution affect both the detectability and recoverability of noisy features. In general, high-intensity noise and large training datasets are conducive to the detection and correction of noisy features. Low-intensity noise reduces detection and recovery but can be compensated for by larger clean training data sets. Detection and correction results vary between features with continuous and dispersed feature distributions showing greater recoverability compared to features with discrete or narrow distributions. This systematic study not only demonstrates a model agnostic framework for rational data recovery in the presence of noise, limited data, and differing feature distributions but also provides a tangible benchmark of kNN imputation in materials data sets. Ultimately, it aims to enhance data quality and experimental precision in automated materials discovery.

View on arXiv PDF

Similar