CVNov 17, 2024

MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild

Xi Fang, Jiankun Wang, Xiaochen Cai, Shangqian Chen, Shuwen Yang, Haoyi Tao, Nan Wang, Lin Yao, Linfeng Zhang, Guolin Ke

arXiv:2411.11098v415.317 citationsh-index: 12

Originality Highly original

AI Analysis

This work addresses a critical bottleneck for large-scale literature searches and applications of large language models in biology, chemistry, and pharmaceuticals by improving the recognition of molecule structures in noisy, real-world documents.

The authors tackled the problem of automatically extracting precise chemical structures from real-world documents, which is hindered by variations in image quality and complex Markush structures, and developed MolParser, an end-to-end method that significantly outperforms existing optical chemical structure recognition methods across most scenarios.

In recent decades, chemistry publications and patents have increased rapidly. A significant portion of key information is embedded in molecular structure figures, complicating large-scale literature searches and limiting the application of large language models in fields such as biology, chemistry, and pharmaceuticals. The automatic extraction of precise chemical structures is of critical importance. However, the presence of numerous Markush structures in real-world documents, along with variations in molecular image quality, drawing styles, and noise, significantly limits the performance of existing optical chemical structure recognition (OCSR) methods. We present MolParser, a novel end-to-end OCSR method that efficiently and accurately recognizes chemical structures from real-world documents, including difficult Markush structure. We use a extended SMILES encoding rule to annotate our training dataset. Under this rule, we build MolParser-7M, the largest annotated molecular image dataset to our knowledge. While utilizing a large amount of synthetic data, we employed active learning methods to incorporate substantial in-the-wild data, specifically samples cropped from real patents and scientific literature, into the training process. We trained an end-to-end molecular image captioning model, MolParser, using a curriculum learning approach. MolParser significantly outperforms classical and learning-based methods across most scenarios, with potential for broader downstream applications. The dataset is publicly available in huggingface.

View on arXiv PDF

Similar