QMAILGGNDec 6, 2020

Align-gram : Rethinking the Skip-gram Model for Protein Sequence Analysis

arXiv:2012.03324v1
AI Analysis

This study offers an incremental improvement in protein sequence representation for researchers using deep learning in bioinformatics.

The authors propose Align-gram, a novel k-mer embedding scheme for protein sequences that maps similar k-mers closer in a vector space. Their experiments show that Align-gram embeddings improve the performance of deep learning models like LSTM and CNN (DeepGoPlus) for protein sequence analysis.

Background: The inception of next generations sequencing technologies have exponentially increased the volume of biological sequence data. Protein sequences, being quoted as the `language of life', has been analyzed for a multitude of applications and inferences. Motivation: Owing to the rapid development of deep learning, in recent years there have been a number of breakthroughs in the domain of Natural Language Processing. Since these methods are capable of performing different tasks when trained with a sufficient amount of data, off-the-shelf models are used to perform various biological applications. In this study, we investigated the applicability of the popular Skip-gram model for protein sequence analysis and made an attempt to incorporate some biological insights into it. Results: We propose a novel $k$-mer embedding scheme, Align-gram, which is capable of mapping the similar $k$-mers close to each other in a vector space. Furthermore, we experiment with other sequence-based protein representations and observe that the embeddings derived from Align-gram aids modeling and training deep learning models better. Our experiments with a simple baseline LSTM model and a much complex CNN model of DeepGoPlus shows the potential of Align-gram in performing different types of deep learning applications for protein sequence analysis.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes