GNLGJul 19, 2023

ProtiGeno: a prokaryotic short gene finder using protein language models

arXiv:2307.10343v12 citationsh-index: 10Has Code
Originality Highly original
AI Analysis

This addresses a bottleneck in prokaryotic gene prediction for applications in medicine and biotechnology, though it is incremental as it specifically targets short genes rather than revolutionizing the field.

The paper tackled the problem of low sensitivity in predicting short prokaryotic genes (<180 nts) due to insufficient annotated data, and developed ProtiGeno, a deep learning method using protein language models, which achieved higher accuracy and recall than state-of-the-art gene finders in experiments on 4,288 prokaryotic genomes.

Prokaryotic gene prediction plays an important role in understanding the biology of organisms and their function with applications in medicine and biotechnology. Although the current gene finders are highly sensitive in finding long genes, their sensitivity decreases noticeably in finding shorter genes (<180 nts). The culprit is insufficient annotated gene data to identify distinguishing features in short open reading frames (ORFs). We develop a deep learning-based method called ProtiGeno, specifically targeting short prokaryotic genes using a protein language model trained on millions of evolved proteins. In systematic large-scale experiments on 4,288 prokaryotic genomes, we demonstrate that ProtiGeno predicts short coding and noncoding genes with higher accuracy and recall than the current state-of-the-art gene finders. We discuss the predictive features of ProtiGeno and possible limitations by visualizing the three-dimensional structure of the predicted short genes. Data, codes, and models are available at https://github.com/tonytu16/protigeno.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes