LGMNFeb 27, 2024

Predicting O-GlcNAcylation Sites in Mammalian Proteins with Transformers and RNNs Trained with a New Loss Function

arXiv:2402.17131v33 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work addresses a specific bioinformatics challenge for researchers and potential therapeutic developers, but it is incremental as it builds on existing RNN models with a novel loss function.

The paper tackled the problem of predicting O-GlcNAcylation sites in mammalian proteins, which lacked reliable methods, by introducing a new loss function called weighted focal differentiable MCC, resulting in state-of-the-art performance with an F1 score of 38.88% and an MCC of 38.20% on an independent test set.

O-GlcNAcylation, a subtype of glycosylation, has the potential to be an important target for therapeutics, but methods to reliably predict O-GlcNAcylation sites had not been available until 2023; a 2021 review correctly noted that published models were insufficient and failed to generalize. Moreover, many are no longer usable. In 2023, a considerably better recurrent neural network (RNN) model was published. This article creates improved models by using a new loss function, which we call the weighted focal differentiable MCC. RNN models trained with this new loss display superior performance to models trained using the weighted cross-entropy loss; this new function can also be used to fine-tune trained models. An RNN trained with this loss achieves state-of-the-art performance in O-GlcNAcylation site prediction with an F$_1$ score of 38.88% and an MCC of 38.20% on an independent test set from the largest dataset available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes