AIJun 3

From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models

Hongyu Guo, Hao Li, He Cao, Gongbo Zhang, Li Yuan

arXiv:2606.0366069.4h-index: 3

Predicted impact top 52% in AI · last 90 daysOriginality Incremental advance

AI Analysis

Provides a low-cost, auditable evaluation tool for diagnosing step-level reasoning failures in LLMs applied to chemistry, addressing the limitations of existing process-level evaluators.

ChemCoTBench-V2 introduces a rule-verifiable benchmark for evaluating chemical reasoning in LLMs, revealing a persistent gap between final-answer correctness and structured reasoning consistency across frontier models.

Large language models are increasingly used as chemistry assistants, yet most chemistry benchmarks still score only final answers. This masks a critical failure mode: a model may output the correct molecule, product, or option while its reasoning violates chemical logic. Existing process-level evaluators are hard to scale because LLM judges and human step-level process annotation are costly, inconsistent, and vulnerable to hallucination. We introduce ChemCoTBench-V2, a rule-verifiable diagnostic benchmark for low-cost, auditable evaluation of structured, verifier-addressable chemical reasoning traces. It spans molecular understanding, molecule editing, molecular optimization, and reaction prediction, with 5,620 evaluation samples across 18 reporting tasks. Models must expose key intermediate steps in expert-designed templates, and those steps are checked with deterministic chemistry rules and, for closed-answer tasks, reference traces rather than another LLM judge. Open-ended molecular optimization is evaluated with oracle-verifiable state constraints rather than strict trace matching. The benchmark reports three separate signals: final-answer correctness, template adherence, and step-wise verifier correctness over expert-refined intermediate commitments. Experiments on frontier models reveal a persistent gap between final-answer success and structured-reasoning-state consistency: models often follow the requested format while failing chemical-step checks, or answer correctly with weak supporting reasoning. ChemCoTBench-V2 enables fine-grained model comparison and identifies the concrete step at which the trace first violates the verifier.

View on arXiv PDF

Similar