CL MTRL-SCI APP-PHAug 25, 2023

DARWIN Series: Domain Specific Large Language Models for Natural Science

Tong Xie, Yuwei Wan, Wei Huang, Zhenyu Yin, Yixuan Liu, Shaozhou Wang, Qingyuan Linghu, Chunyu Kit, Clara Grazian, Wenjie Zhang, Imran Razzak, Bram Hoex

arXiv:2308.13565v19.952 citationsh-index: 39Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for automated, parallel processes in natural science (e.g., physics, chemistry, material science) to accelerate discovery, though it is incremental as it builds on existing LLMs with domain-specific fine-tuning.

The authors tackled the problem of automating scientific discovery in natural sciences by developing DARWIN, a series of domain-specific large language models fine-tuned on over 60,000 instruction data points, which achieved state-of-the-art results on various scientific tasks and reduced reliance on closed-source AI models.

Emerging tools bring forth fresh approaches to work, and the field of natural science is no different. In natural science, traditional manual, serial, and labour-intensive work is being augmented by automated, parallel, and iterative processes driven by artificial intelligence-based experimental automation and more. To add new capabilities in natural science, enabling the acceleration and enrichment of automation of the discovery process, we present DARWIN, a series of tailored LLMs for natural science, mainly in physics, chemistry, and material science. This series relies on open-source LLM, incorporating structured and unstructured scientific knowledge from public datasets and literature. We fine-tuned the models using over 60,000 instruction data points, emphasizing factual correctness. During the fine-tuning, we introduce the Scientific Instruction Generation (SIG) model, automating instruction generation from scientific texts. This eliminates the need for manual extraction or domain-specific knowledge graphs and efficiently injects scientific knowledge into the model. We also explore multi-task training strategies, revealing interconnections between scientific tasks. DARWIN series not only achieves state-of-the-art results on various scientific tasks but also diminishes reliance on closed-source AI models. Our research showcases the ability of LLM in the scientific domain, with the overarching goal of fostering prosperity within the broader AI for science community.

View on arXiv PDF Code

Similar