Clever Hans in Chemistry: Chemist Style Signals Confound Activity Prediction on Public Benchmarks
This reveals a failure mode in public benchmarks that could mislead drug discovery efforts, highlighting the need for better dataset practices to avoid confounding signals.
The study found that machine learning models can predict bioactivity by inferring which chemist made a molecule, rather than learning causal chemistry, achieving 60% top-5 accuracy in predicting authors and comparable predictive power to structure-based baselines.
Can machine learning models identify which chemist made a molecule from structure alone? If so, models trained on literature data may exploit chemist intent rather than learning causal structure-activity relationships. We test this by linking CHEMBL assays to publication authors and training a 1,815-class classifier to predict authors from molecular fingerprints, achieving 60% top-5 accuracy under scaffold-based splitting. We then train an activity model that receives only a protein identifier and an author-probability vector derived from structure, with no direct access to molecular descriptors. This author-only model achieves predictive power comparable to a simple baseline that has access to structure. This reveals a "Clever Hans" failure mode: models can predict bioactivity largely by inferring chemist goals and favorite targets without requiring a lab-independent understanding of chemistry. We analyze the sources of this leakage, propose author-disjoint splits, and recommend dataset practices to decouple chemist intent from biological outcomes.