LGCLFeb 8, 2025

Graph-based Molecular In-context Learning Grounded on Morgan Fingerprints

arXiv:2502.05414v14 citationsh-index: 7
Originality Highly original
AI Analysis

This work addresses the limitations of current prompt retrieval methods for molecular tasks, which is significant for researchers and practitioners working with large language models in the field of molecular property prediction and molecule captioning.

The authors tackled the problem of in-context learning for molecular tasks by proposing a new technique called GAMIC, which aligns global molecular structures with textual captions and leverages local feature similarity, resulting in up to 45% improvement over existing methods. GAMIC outperforms simple Morgan-based ICL retrieval methods across all tasks.

In-context learning (ICL) effectively conditions large language models (LLMs) for molecular tasks, such as property prediction and molecule captioning, by embedding carefully selected demonstration examples into the input prompt. This approach avoids the computational overhead of extensive pertaining and fine-tuning. However, current prompt retrieval methods for molecular tasks have relied on molecule feature similarity, such as Morgan fingerprints, which do not adequately capture the global molecular and atom-binding relationships. As a result, these methods fail to represent the full complexity of molecular structures during inference. Moreover, small-to-medium-sized LLMs, which offer simpler deployment requirements in specialized systems, have remained largely unexplored in the molecular ICL literature. To address these gaps, we propose a self-supervised learning technique, GAMIC (Graph-Aligned Molecular In-Context learning, which aligns global molecular structures, represented by graph neural networks (GNNs), with textual captions (descriptions) while leveraging local feature similarity through Morgan fingerprints. In addition, we introduce a Maximum Marginal Relevance (MMR) based diversity heuristic during retrieval to optimize input prompt demonstration samples. Our experimental findings using diverse benchmark datasets show GAMIC outperforms simple Morgan-based ICL retrieval methods across all tasks by up to 45%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes