LGAIBMOct 18, 2025

Atom-anchored LLMs speak Chemistry: A Retrosynthesis Demonstration

arXiv:2510.16590v12 citationsh-index: 27
Originality Incremental advance
AI Analysis

This work addresses data scarcity in chemistry for researchers and drug discovery, though it is incremental as it builds on existing LLM methods with a novel anchoring approach.

The paper tackles the problem of data scarcity in chemistry by introducing a framework using Large Language Models (LLMs) with atomic identifiers for molecular reasoning without labeled data, achieving high success rates such as ≥90% for plausible reaction sites and ≥74% for final reactants in retrosynthesis tasks.

Applications of machine learning in chemistry are often limited by the scarcity and expense of labeled data, restricting traditional supervised methods. In this work, we introduce a framework for molecular reasoning using general-purpose Large Language Models (LLMs) that operates without requiring labeled training data. Our method anchors chain-of-thought reasoning to the molecular structure by using unique atomic identifiers. First, the LLM performs a one-shot task to identify relevant fragments and their associated chemical labels or transformation classes. In an optional second step, this position-aware information is used in a few-shot task with provided class examples to predict the chemical transformation. We apply our framework to single-step retrosynthesis, a task where LLMs have previously underperformed. Across academic benchmarks and expert-validated drug discovery molecules, our work enables LLMs to achieve high success rates in identifying chemically plausible reaction sites ($\geq90\%$), named reaction classes ($\geq40\%$), and final reactants ($\geq74\%$). Beyond solving complex chemical tasks, our work also provides a method to generate theoretically grounded synthetic datasets by mapping chemical knowledge onto the molecular structure and thereby addressing data scarcity.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes