ML LGNov 25, 2025

When Features Beat Noise: A Feature Selection Technique Through Noise-Based Hypothesis Testing

Mousam Sinha, Tirtha Sarathi Ghosh, Ridam Pal

arXiv:2511.20851v24.5

Originality Incremental advance

AI Analysis

This provides a more statistically principled feature selection method for machine learning practitioners dealing with noisy, high-dimensional data, though it appears incremental relative to existing noise-based approaches.

The paper tackles the problem of feature selection in high-dimensional datasets by proposing a method that uses noise features with bootstrap-based hypothesis testing to identify significant predictors. The method consistently outperformed existing techniques like Boruta and RFE in simulated and real-world datasets, showing stronger recovery of meaningful signals.

Feature selection has remained a daunting challenge in machine learning and artificial intelligence, where increasingly complex, high-dimensional datasets demand principled strategies for isolating the most informative predictors. Despite widespread adoption, many established techniques suffer from notable limitations; some incur substantial computational cost, while others offer no definite statistical driven stopping criteria or assesses the significance of their importance scores. A common heuristic approach introduces multiple random noise features and retains all predictors ranked above the strongest noise feature. Although intuitive, this strategy lacks theoretical justification and depends heavily on heuristics. This paper proposes a novel feature selection method that addresses these limitations. Our approach introduces multiple random noise features and evaluates each feature's importance against the maximum importance value among these noise features incorporating a non-parametric bootstrap-based hypothesis testing framework to establish a solid theoretical foundation. We establish the conceptual soundness of our approach through statistical derivations that articulate the principles guiding the design of our algorithm. To evaluate its reliability, we generated simulated datasets under controlled statistical settings and benchmarked performance against Boruta and Knockoff-based methods, observing consistently stronger recovery of meaningful signal. As a demonstration of practical utility, we applied the technique across diverse real-world datasets, where it surpassed feature selection techniques including Boruta, RFE, and Extra Trees. Hence, the method emerges as a robust algorithm for principled feature selection, enabling the distillation of informative predictors that support reliable inference, enhanced predictive performance, and efficient computation.

View on arXiv PDF

Similar