A Transformer Model for Predicting Chemical Products from Generic SMARTS Templates with Data Augmentation
This work provides a novel and practical alternative for computational chemists by addressing limitations in template-based and template-free reaction prediction methods.
The paper tackled the challenge of predicting chemical reaction outcomes by introducing the Broad Reaction Set (BRS) with 20 generic SMARTS templates and ProPreT5, a T5-based model that directly handles SMARTS templates, achieving strong predictive performance and generalization to unseen reactions.
The accurate prediction of chemical reaction outcomes is a major challenge in computational chemistry. Current models rely heavily on either highly specific reaction templates or template-free methods, both of which present limitations. To address these, this work proposes the Broad Reaction Set (BRS), a set featuring 20 generic reaction templates written in SMARTS, a pattern-based notation designed to describe substructures and reactivity. Additionally, we introduce ProPreT5, a T5-based model specifically adapted for chemistry and, to the best of our knowledge, the first language model capable of directly handling and applying SMARTS reaction templates. To further improve generalization, we propose the first augmentation strategy for SMARTS, which injects structural diversity at the pattern level. Trained on augmented templates, ProPreT5 demonstrates strong predictive performance and generalization to unseen reactions. Together, these contributions provide a novel and practical alternative to current methods, advancing the field of template-based reaction prediction.