Surprisingly High Redundancy in Electronic Structure Data

arXiv:2507.09001v12 citationsh-index: 17
Originality Incremental advance
AI Analysis

This addresses the high computational cost of generating electronic structure data for materials science, offering a way to reduce dataset sizes and training times, though it is incremental as it builds on existing pruning methods.

The study tackled the problem of redundancy in electronic structure datasets used for machine learning, revealing that random pruning can reduce dataset size with minimal accuracy loss, while a coverage-based strategy achieves chemical accuracy with up to 100-fold less data and threefold faster training.

Machine Learning (ML) models for electronic structure rely on large datasets generated through expensive Kohn-Sham Density Functional Theory simulations. This study reveals a surprisingly high level of redundancy in such datasets across various material systems, including molecules, simple metals, and complex alloys. Our findings challenge the prevailing assumption that large, exhaustive datasets are necessary for accurate ML predictions of electronic structure. We demonstrate that even random pruning can substantially reduce dataset size with minimal loss in predictive accuracy, while a state-of-the-art coverage-based pruning strategy retains chemical accuracy and model generalizability using up to 100-fold less data and reducing training time by threefold or more. By contrast, widely used importance-based pruning methods, which eliminate seemingly redundant data, can catastrophically fail at higher pruning factors, possibly due to the significant reduction in data coverage. This heretofore unexplored high degree of redundancy in electronic structure data holds the potential to identify a minimal, essential dataset representative of each material class.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes