IR AI LG QMFeb 20, 2024

ChemMiner: A Large Language Model Agent System for Chemical Literature Data Mining

Kexin Chen, Yuyang Du, Junyou Li, Hanqun Cao, Menghao Guo, Xilin Dang, Lanqing Li, Jiezhong Qiu, Pheng Ann Heng, Guangyong Chen

arXiv:2402.12993v28.19 citationsh-index: 10Has Code2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

Originality Incremental advance

AI Analysis

This addresses the need for comprehensive chemical datasets for AI-assisted synthesis tools, offering a scalable solution to mine underexplored literature data, though it is incremental as it builds on existing LLM and agent technologies.

The paper tackles the problem of extracting chemical reaction data from literature, which is challenging due to varied writing styles and multimodal content, by proposing ChemMiner, an LLM-based agent system that achieves reaction identification rates comparable to human chemists with high accuracy, recall, and F1 scores while reducing processing time.

The development of AI-assisted chemical synthesis tools requires comprehensive datasets covering diverse reaction types, yet current high-throughput experimental (HTE) approaches are expensive and limited in scope. Chemical literature represents a vast, underexplored data source containing thousands of reactions published annually. However, extracting reaction information from literature faces significant challenges including varied writing styles, complex coreference relationships, and multimodal information presentation. This paper proposes ChemMiner, a novel end-to-end framework leveraging multiple agents powered by large language models (LLMs) to extract high-fidelity chemical data from literature. ChemMiner incorporates three specialized agents: a text analysis agent for coreference mapping, a multimodal agent for non-textual information extraction, and a synthesis analysis agent for data generation. Furthermore, we developed a comprehensive benchmark with expert-annotated chemical literature to evaluate both extraction efficiency and precision. Experimental results demonstrate reaction identification rates comparable to human chemists while significantly reducing processing time, with high accuracy, recall, and F1 scores. Our open-sourced benchmark facilitates future research in chemical literature data mining.

View on arXiv PDF

Similar