GN AI CL QMNov 11, 2024

LA4SR: illuminating the dark proteome with generative AI

David R. Nelson, Ashish Kumar Jaiswal, Noha Ismail, Alexandra Mystikou, Kourosh Salehi-Ashtiani

arXiv:2411.06798v2h-index: 38Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of analyzing the dark proteome, which comprises about 65% of total proteins, for researchers in microbiology and bioinformatics, representing a strong specific gain rather than a broad paradigm shift.

The authors tackled the classification of uncharacterized proteins in the dark proteome using re-engineered generative AI language models, achieving F1 scores up to 95, operating 16,580x faster and with 2.9x higher recall than BLASTP, and high accuracy with minimal training data.

AI language models (LMs) show promise for biological sequence analysis. We re-engineered open-source LMs (GPT-2, BLOOM, DistilRoBERTa, ELECTRA, and Mamba, ranging from 70M to 12B parameters) for microbial sequence classification. The models achieved F1 scores up to 95 and operated 16,580x faster and at 2.9x the recall of BLASTP. They effectively classified the algal dark proteome - uncharacterized proteins comprising about 65% of total proteins - validated on new data including a new, complete Hi-C/Pacbio Chlamydomonas genome. Larger (>1B) LA4SR models reached high accuracy (F1 > 86) when trained on less than 2% of available data, rapidly achieving strong generalization capacity. High accuracy was achieved when training data had intact or scrambled terminal information, demonstrating robust generalization to incomplete sequences. Finally, we provide custom AI explainability software tools for attributing amino acid patterns to AI generative processes and interpret their outputs in evolutionary and biophysical contexts.

View on arXiv PDF

Similar