SubGrapher: Visual Fingerprinting of Chemical Structures
This work addresses the need for accessible chemical structure information in fields like drug discovery and materials science, offering an incremental improvement over traditional optical recognition approaches.
The paper tackles the problem of extracting chemical structures from images in scientific literature, particularly patents, by introducing SubGrapher, a method that directly extracts molecular fingerprints from images, achieving superior retrieval performance and robustness compared to existing methods.
Automatic extraction of chemical structures from scientific literature plays a crucial role in accelerating research across fields ranging from drug discovery to materials science. Patent documents, in particular, contain molecular information in visual form, which is often inaccessible through traditional text-based searches. In this work, we introduce SubGrapher, a method for the visual fingerprinting of chemical structure images. Unlike conventional Optical Chemical Structure Recognition (OCSR) models that attempt to reconstruct full molecular graphs, SubGrapher focuses on extracting molecular fingerprints directly from chemical structure images. Using learning-based instance segmentation, SubGrapher identifies functional groups and carbon backbones, constructing a substructure-based fingerprint that enables chemical structure retrieval. Our approach is evaluated against state-of-the-art OCSR and fingerprinting methods, demonstrating superior retrieval performance and robustness across diverse molecular depictions. The dataset, models, and code are publicly available.