KnowMol: Advancing Molecular Large Language Models with Multi-Level Chemical Knowledge
This work addresses the problem of inadequate textual descriptions and suboptimal representations in molecular large language models for researchers in computational chemistry and drug discovery, representing a strong specific gain rather than a foundational advancement.
The authors tackled the limitations of molecular large language models by introducing KnowMol-100K, a dataset with 100K fine-grained annotations, and a chemically-informative representation, resulting in a state-of-the-art model that achieves superior performance in molecular understanding and generation tasks.
The molecular large language models have garnered widespread attention due to their promising potential on molecular applications. However, current molecular large language models face significant limitations in understanding molecules due to inadequate textual descriptions and suboptimal molecular representation strategies during pretraining. To address these challenges, we introduce KnowMol-100K, a large-scale dataset with 100K fine-grained molecular annotations across multiple levels, bridging the gap between molecules and textual descriptions. Additionally, we propose chemically-informative molecular representation, effectively addressing limitations in existing molecular representation strategies. Building upon these innovations, we develop KnowMol, a state-of-the-art multi-modal molecular large language model. Extensive experiments demonstrate that KnowMol achieves superior performance across molecular understanding and generation tasks. GitHub: https://github.com/yzf-code/KnowMol Huggingface: https://hf.co/datasets/yzf1102/KnowMol-100K