BMAILGDec 26, 2024

Biology-Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models

arXiv:2412.19191v25 citationsh-index: 16Has CodeEMNLP
Originality Incremental advance
AI Analysis

This work addresses the problem of limited LLM capabilities in multi-omics sequence analysis for researchers in computational biology, representing an incremental advancement by providing new datasets and methods for specialized training.

The authors tackled the underexplored application of large language models (LLMs) to multi-omics biology by introducing Biology-Instructions, a large-scale instruction-tuning dataset for DNA, RNA, proteins, and multi-molecules, and ChatMultiOmics, a baseline model that demonstrates superior biological understanding through a novel three-stage training pipeline.

Large language models (LLMs) have shown remarkable capabilities in general domains, but their application to multi-omics biology remains underexplored. To address this gap, we introduce Biology-Instructions, the first large-scale instruction-tuning dataset for multi-omics biological sequences, including DNA, RNA, proteins, and multi-molecules. This dataset bridges LLMs and complex biological sequence-related tasks, enhancing their versatility and reasoning while maintaining conversational fluency. We also highlight significant limitations of current state-of-the-art LLMs on multi-omics tasks without specialized training. To overcome this, we propose ChatMultiOmics, a strong baseline with a novel three-stage training pipeline, demonstrating superior biological understanding through Biology-Instructions. Both resources are publicly available, paving the way for better integration of LLMs in multi-omics analysis. The Biology-Instructions is publicly available at: https://github.com/hhnqqq/Biology-Instructions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes