Censoring chemical data to mitigate dual use risk
This addresses dual-use risks in open-source chemistry models for researchers and policymakers, though it is incremental as it builds on existing risk frameworks.
The paper tackles the problem of machine learning models' dual-use potential in chemistry by introducing a model-agnostic noising method at the data level to increase prediction error in sensitive regions, showing that selective noise induces variance and attenuation bias while omitting data fails to prevent extrapolation.
Machine learning models have dual-use potential, potentially serving both beneficial and malicious purposes. The development of open-source models in chemistry has specifically surfaced dual-use concerns around toxicological data and chemical warfare agents. We discuss a chain risk framework identifying three misuse pathways and corresponding mitigation strategies: inference-level, model-level, and data-level. At the data level, we introduce a model-agnostic noising method to increase prediction error in specific desired regions (sensitive regions). Our results show that selective noise induces variance and attenuation bias, whereas simply omitting sensitive data fails to prevent extrapolation. These findings hold for both molecular feature multilayer perceptrons and graph neural networks. Thus, noising molecular structures can enable open sharing of potential dual-use molecular data.