CLAIApr 25, 2022

Translation between Molecules and Natural Language

arXiv:2204.11817v3377 citationsh-index: 21
Originality Highly original
AI Analysis

This addresses the data scarcity issue in chemistry for researchers and practitioners, representing a novel application rather than an incremental improvement.

The paper tackles the problem of translating between molecules and natural language by introducing MolT5, a self-supervised learning framework that enables tasks like molecule captioning and text-based molecule generation, with results showing high-quality outputs in many cases.

We present $\textbf{MolT5}$ $-$ a self-supervised learning framework for pretraining models on a vast amount of unlabeled natural language text and molecule strings. $\textbf{MolT5}$ allows for new, useful, and challenging analogs of traditional vision-language tasks, such as molecule captioning and text-based de novo molecule generation (altogether: translation between molecules and language), which we explore for the first time. Since $\textbf{MolT5}$ pretrains models on single-modal data, it helps overcome the chemistry domain shortcoming of data scarcity. Furthermore, we consider several metrics, including a new cross-modal embedding-based metric, to evaluate the tasks of molecule captioning and text-based molecule generation. Our results show that $\textbf{MolT5}$-based models are able to generate outputs, both molecules and captions, which in many cases are high quality.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes