AI CLJun 8, 2024

ChatSR: Multimodal Large Language Models for Scientific Formula Discovery

Yanjie Li, Lina Yu, Weijun Li, Min Wu, Jingyi Liu, Wenqiang Li, Shu Wei, Yusong Deng

arXiv:2406.05410v27.33 citationsh-index: 63

Originality Highly original

AI Analysis

This addresses the challenge of limited prior knowledge integration in symbolic regression for AI researchers, offering a more flexible and human-like approach.

The authors tackled the symbolic regression problem by developing ChatSR, a multimodal large language model that incorporates natural language prior knowledge to guide formula generation, achieving state-of-the-art performance on 13 datasets and demonstrating strong zero-shot capabilities.

Formulas are the language of communication between humans and nature. The discovery of formulas to describe natural laws from observational data is the purpose of scientific research. It is also an important research topic in artificial intelligence, which is called a symbolic regression problem. Most of the existing symbolic regression methods generate expressions directly from observed data. Although in some methods, we can inject some prior knowledge into the model by adding constraints or introducing some special character hints. However, these methods can only introduce a limited amount of prior knowledge specified in advance. Not to mention understanding natural language instructions. In this article, based on the powerful knowledge reserve and language understanding ability of multi-modal large language models, we present ChatSR, which acts like a knowledgeable human scientist, and we can tell it any prior knowledge through natural language to guide it in formula generation. By testing on 13 datasets, ChatSR not only shows state-of-the-art performance on traditional symbolic regression tasks. More notably, ChatSR can well understand the prior knowledge contained in natural language prompts and improve the quality of generated expressions. In addition, it is exciting that ChatSR has a good zero-shot capability to understand prior knowledge that is not present in the training data.

View on arXiv PDF

Similar