IRAILGQMFeb 20, 2024

ChemMiner: A Large Language Model Agent System for Chemical Literature Data Mining

arXiv:2402.12993v29 citationsh-index: 10Has Code2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
Originality Incremental advance
AI Analysis

This addresses the need for comprehensive chemical datasets for AI-assisted synthesis tools, offering a scalable solution to mine underexplored literature data, though it is incremental as it builds on existing LLM and agent technologies.

The paper tackles the problem of extracting chemical reaction data from literature, which is challenging due to varied writing styles and multimodal content, by proposing ChemMiner, an LLM-based agent system that achieves reaction identification rates comparable to human chemists with high accuracy, recall, and F1 scores while reducing processing time.

The development of AI-assisted chemical synthesis tools requires comprehensive datasets covering diverse reaction types, yet current high-throughput experimental (HTE) approaches are expensive and limited in scope. Chemical literature represents a vast, underexplored data source containing thousands of reactions published annually. However, extracting reaction information from literature faces significant challenges including varied writing styles, complex coreference relationships, and multimodal information presentation. This paper proposes ChemMiner, a novel end-to-end framework leveraging multiple agents powered by large language models (LLMs) to extract high-fidelity chemical data from literature. ChemMiner incorporates three specialized agents: a text analysis agent for coreference mapping, a multimodal agent for non-textual information extraction, and a synthesis analysis agent for data generation. Furthermore, we developed a comprehensive benchmark with expert-annotated chemical literature to evaluate both extraction efficiency and precision. Experimental results demonstrate reaction identification rates comparable to human chemists while significantly reducing processing time, with high accuracy, recall, and F1 scores. Our open-sourced benchmark facilitates future research in chemical literature data mining.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes