BMLGApr 25, 2021

Random Embeddings and Linear Regression can Predict Protein Function

arXiv:2104.14661v1
Originality Synthesis-oriented
AI Analysis

This work addresses a methodological gap for researchers in computational biology by providing incremental baselines to assess pretraining benefits in protein function prediction.

The paper tackled the problem of evaluating whether pretrained protein sequence embeddings learn useful information for function prediction by showing that one-hot encoding and random embeddings, which require no pretraining, serve as strong baselines across 14 diverse tasks.

Large self-supervised models pretrained on millions of protein sequences have recently gained popularity in generating embeddings of protein sequences for protein function prediction. However, the absence of random baselines makes it difficult to conclude whether pretraining has learned useful information for protein function prediction. Here we show that one-hot encoding and random embeddings, both of which do not require any pretraining, are strong baselines for protein function prediction across 14 diverse sequence-to-function tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes