LGAIBMFeb 10, 2025

Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language

arXiv:2502.06634v112 citationsh-index: 3NAACL
Originality Incremental advance
AI Analysis

This work addresses a bottleneck in AI for biological research, specifically for drug discovery applications, by providing a method to enhance dataset quality, though it appears incremental as it builds on existing architectures and datasets.

The paper tackles the scarcity of high-quality annotations for integrating molecular data with natural language in drug discovery by introducing LA^3, a framework that uses large language models to augment datasets, resulting in up to 301% improvement in text-based molecule generation and captioning tasks.

Recent advancements in AI for biological research focus on integrating molecular data with natural language to accelerate drug discovery. However, the scarcity of high-quality annotations limits progress in this area. This paper introduces LA$^3$, a Language-based Automatic Annotation Augmentation framework that leverages large language models to augment existing datasets, thereby improving AI training. We demonstrate the effectiveness of LA$^3$ by creating an enhanced dataset, LaChEBI-20, where we systematically rewrite the annotations of molecules from an established dataset. These rewritten annotations preserve essential molecular information while providing more varied sentence structures and vocabulary. Using LaChEBI-20, we train LaMolT5 based on a benchmark architecture to learn the mapping between molecular representations and augmented annotations. Experimental results on text-based *de novo* molecule generation and molecule captioning demonstrate that LaMolT5 outperforms state-of-the-art models. Notably, incorporating LA$^3$ leads to improvements of up to 301% over the benchmark architecture. Furthermore, we validate the effectiveness of LA$^3$ notable applications in *image*, *text* and *graph* tasks, affirming its versatility and utility.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes