Building informative materials datasets beyond targeted objectives
This work addresses the problem of dataset bias in materials science, where ignoring certain properties can degrade future learning tasks, by providing a method to build datasets that remain broadly informative for both targeted and untargeted outcomes.
The paper introduces a diversity-aware framework for constructing materials datasets that maximizes informativeness for targeted properties while preserving performance on untargeted ones. In noisy experimental datasets, the framework improves prediction performance on untargeted properties by up to 10% and on targeted properties by up to 25%, compared to random sampling.
Materials science data collection can be expensive, making the reuse and long-term utility of datasets critical important for future discovery campaigns. In practice, researchers prioritize a subset of properties due to research interests. However, ignoring a subset of outcomes in data collection campaigns potentially generate datasets poorly suited for future learning tasks. Here, we present a framework for dataset construction that maximizes informativeness for target properties of interest while preserving performance on untargeted ones. Our approach uses diversity-aware selection to ensure broad coverage of the materials space. In noisy experimental dataset construction, we find that without our diversity-aware framework, prediction performance on untargeted properties can degrade by up to 40% relative to random sampling, whereas applying our framework yields improvements of up to 10% . For targeted properties, performance can degrade with respect to random sampling by up to 12.5% without diversity, while our framework achieves gains of up to 25%. Incorporating diversity into dataset construction not only preserves informativeness for the targeted properties, but also improves materials coverage for potential future objectives. As a result, the constructed datasets remain broadly informative across considered and unconsidered outcomes, ensuring unbiased quality entries and mitigating cold-start limitations in subsequent modeling and discovery campaigns.