LGApr 30, 2025

MolMole: Molecule Mining from Scientific Literature

arXiv:2505.03777v21 citationsh-index: 6
Originality Incremental advance
AI Analysis

This addresses the challenge of automating chemical data extraction from documents for researchers and industries, representing a strong domain-specific advancement.

The authors tackled the problem of extracting molecular structures and reaction data from unstructured scientific documents by introducing MolMole, a vision-based deep learning framework that unifies detection, parsing, and recognition into a single pipeline, and it outperformed existing toolkits on benchmarks including a new testset of 550 pages.

The extraction of molecular structures and reaction data from scientific documents is challenging due to their varied, unstructured chemical formats and complex document layouts. To address this, we introduce MolMole, a vision-based deep learning framework that unifies molecule detection, reaction diagram parsing, and optical chemical structure recognition (OCSR) into a single pipeline for automating the extraction of chemical data directly from page-level documents. Recognizing the lack of a standard page-level benchmark and evaluation metric, we also present a testset of 550 pages annotated with molecule bounding boxes, reaction labels, and MOLfiles, along with a novel evaluation metric. Experimental results demonstrate that MolMole outperforms existing toolkits on both our benchmark and public datasets. The benchmark testset will be publicly available, and the MolMole toolkit will be accessible soon through an interactive demo on the LG AI Research website. For commercial inquiries, please contact us at \href{mailto:contact_ddu@lgresearch.ai}{contact\_ddu@lgresearch.ai}.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes