LGQMJan 13, 2025

Scaling Up ESM2 Architectures for Long Protein Sequences Analysis: Long and Quantized Approaches

arXiv:2501.07747v11 citationsh-index: 4BSB
Originality Incremental advance
AI Analysis

This addresses a bottleneck for researchers analyzing long protein sequences, but it is incremental as it builds directly on existing ESM2 architectures.

The paper tackled the limitation of ESM2 architectures in handling long protein sequences by developing long and quantized versions, doubling the input size limit from 1,022 to 2,048 amino acids.

Various approaches utilizing Transformer architectures have achieved state-of-the-art results in Natural Language Processing (NLP). Based on this success, numerous architectures have been proposed for other types of data, such as in biology, particularly for protein sequences. Notably among these are the ESM2 architectures, pre-trained on billions of proteins, which form the basis of various state-of-the-art approaches in the field. However, the ESM2 architectures have a limitation regarding input size, restricting it to 1,022 amino acids, which necessitates the use of preprocessing techniques to handle sequences longer than this limit. In this paper, we present the long and quantized versions of the ESM2 architectures, doubling the input size limit to 2,048 amino acids.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes