CLAIBMQMFeb 22, 2024

L+M-24: Building a Dataset for Language + Molecules @ ACL 2024

arXiv:2403.00791v238 citationsh-index: 13LANGMOL
Originality Synthesis-oriented
AI Analysis

This addresses a data bottleneck for researchers in molecular discovery and understanding, but it is incremental as it builds on existing dataset types.

The paper tackles the scarcity of high-quality molecule-language pair datasets by introducing the L+M-24 dataset, which is designed to focus on compositionality, functionality, and abstraction for training language-molecule models.

Language-molecule models have emerged as an exciting direction for molecular discovery and understanding. However, training these models is challenging due to the scarcity of molecule-language pair datasets. At this point, datasets have been released which are 1) small and scraped from existing databases, 2) large but noisy and constructed by performing entity linking on the scientific literature, and 3) built by converting property prediction datasets to natural language using templates. In this document, we detail the $\textit{L+M-24}$ dataset, which has been created for the Language + Molecules Workshop shared task at ACL 2024. In particular, $\textit{L+M-24}$ is designed to focus on three key benefits of natural language in molecule design: compositionality, functionality, and abstraction.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes