Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation
This addresses the need for better evaluation of LLMs in creative molecule discovery for researchers in computational chemistry and drug design, though it is incremental as it builds on existing LLM capabilities.
The authors tackled the problem of evaluating LLMs for natural language-driven molecule generation by proposing Speak-to-Structure (S^2-Bench), a benchmark for open-domain tasks with one-to-many relationships, and introduced OpenMolIns, a dataset that enabled Llama-3.1-8B to outperform models like GPT-4o and Claude-3.5 on this benchmark.
Recently, Large Language Models (LLMs) have shown great potential in natural language-driven molecule discovery. However, existing datasets and benchmarks for molecule-text alignment are predominantly built on a one-to-one mapping, measuring LLMs' ability to retrieve a single, pre-defined answer, rather than their creative potential to generate diverse, yet equally valid, molecular candidates. To address this critical gap, we propose Speak-to-Structure (S^2-Bench}), the first benchmark to evaluate LLMs in open-domain natural language-driven molecule generation. S^2-Bench is specifically designed for one-to-many relationships, challenging LLMs to demonstrate genuine molecular understanding and generation capabilities. Our benchmark includes three key tasks: molecule editing (MolEdit), molecule optimization (MolOpt), and customized molecule generation (MolCustom), each probing a different aspect of molecule discovery. We also introduce OpenMolIns, a large-scale instruction tuning dataset that enables Llama-3.1-8B to surpass the most powerful LLMs like GPT-4o and Claude-3.5 on S^2-Bench. Our comprehensive evaluation of 28 LLMs shifts the focus from simple pattern recall to realistic molecular design, paving the way for more capable LLMs in natural language-driven molecule discovery.