Biology-Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models
This work addresses the problem of limited LLM capabilities in multi-omics sequence analysis for researchers in computational biology, representing an incremental advancement by providing new datasets and methods for specialized training.
The authors tackled the underexplored application of large language models (LLMs) to multi-omics biology by introducing Biology-Instructions, a large-scale instruction-tuning dataset for DNA, RNA, proteins, and multi-molecules, and ChatMultiOmics, a baseline model that demonstrates superior biological understanding through a novel three-stage training pipeline.
Large language models (LLMs) have shown remarkable capabilities in general domains, but their application to multi-omics biology remains underexplored. To address this gap, we introduce Biology-Instructions, the first large-scale instruction-tuning dataset for multi-omics biological sequences, including DNA, RNA, proteins, and multi-molecules. This dataset bridges LLMs and complex biological sequence-related tasks, enhancing their versatility and reasoning while maintaining conversational fluency. We also highlight significant limitations of current state-of-the-art LLMs on multi-omics tasks without specialized training. To overcome this, we propose ChatMultiOmics, a strong baseline with a novel three-stage training pipeline, demonstrating superior biological understanding through Biology-Instructions. Both resources are publicly available, paving the way for better integration of LLMs in multi-omics analysis. The Biology-Instructions is publicly available at: https://github.com/hhnqqq/Biology-Instructions.