LGCLBMAug 17, 2021

Modeling Protein Using Large-scale Pretrain Language Model

arXiv:2108.07435v240 citationsHas Code
AI Analysis

This work addresses the problem of labor-intensive protein analysis for researchers in biology and medicine, offering a novel deep learning approach that is incremental in applying language models to protein sequences.

The authors tackled protein sequence analysis by using a large-scale pretrained language model to capture evolutionary information, achieving significant improvements in both token-level and sequence-level tasks.

Protein is linked to almost every life process. Therefore, analyzing the biological structure and property of protein sequences is critical to the exploration of life, as well as disease detection and drug discovery. Traditional protein analysis methods tend to be labor-intensive and time-consuming. The emergence of deep learning models makes modeling data patterns in large quantities of data possible. Interdisciplinary researchers have begun to leverage deep learning methods to model large biological datasets, e.g. using long short-term memory and convolutional neural network for protein sequence classification. After millions of years of evolution, evolutionary information is encoded in protein sequences. Inspired by the similarity between natural language and protein sequences, we use large-scale language models to model evolutionary-scale protein sequences, encoding protein biology information in representation. Significant improvements are observed in both token-level and sequence-level tasks, demonstrating that our large-scale model can accurately capture evolution information from pretraining on evolutionary-scale individual sequences. Our code and model are available at https://github.com/THUDM/ProteinLM.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes