LGAICLQMJun 21, 2024

Geneverse: A collection of Open-source Multimodal Large Language Models for Genomic and Proteomic Research

arXiv:2406.15534v123 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses a gap in applying LLMs to genomic and proteomic research, offering open-source models for domain-specific tasks, though it is incremental as it builds on existing finetuning techniques.

The authors tackled the limited application of large language models in genomics and proteomics by proposing Geneverse, a collection of finetuned and multimodal LLMs for three novel tasks, demonstrating that these models perform well and may outperform closed-source large-scale models in evaluations focusing on truthfulness and structural correctness.

The applications of large language models (LLMs) are promising for biomedical and healthcare research. Despite the availability of open-source LLMs trained using a wide range of biomedical data, current research on the applications of LLMs to genomics and proteomics is still limited. To fill this gap, we propose a collection of finetuned LLMs and multimodal LLMs (MLLMs), known as Geneverse, for three novel tasks in genomic and proteomic research. The models in Geneverse are trained and evaluated based on domain-specific datasets, and we use advanced parameter-efficient finetuning techniques to achieve the model adaptation for tasks including the generation of descriptions for gene functions, protein function inference from its structure, and marker gene selection from spatial transcriptomic data. We demonstrate that adapted LLMs and MLLMs perform well for these tasks and may outperform closed-source large-scale models based on our evaluations focusing on both truthfulness and structural correctness. All of the training strategies and base models we used are freely accessible.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes