MECRAPNov 23, 2020

Beta-CoRM: A Bayesian Approach for $n$-gram Profiles Analysis

arXiv:2011.11558v3
AI Analysis

This work addresses the limitation of existing machine learning algorithms in discovering hidden structures and providing full probabilistic representations for n-gram profile analysis, which is relevant for researchers working with sequence data.

This paper introduces a new class of Bayesian generative models for analyzing n-gram profiles, treating them as binary attributes. The models enable the discovery of hidden structures and provide a probabilistic representation of the data, demonstrating improved classification accuracy through feature selection on both synthetic and real datasets.

$n$-gram profiles have been successfully and widely used to analyse long sequences of potentially differing lengths for clustering or classification. Mainly, machine learning algorithms have been used for this purpose but, despite their predictive performance, these methods cannot discover hidden structures or provide a full probabilistic representation of the data. A novel class of Bayesian generative models designed for $n$-gram profiles used as binary attributes have been designed to address this. The flexibility of the proposed modelling allows to consider a straightforward approach to feature selection in the generative model. Furthermore, a slice sampling algorithm is derived for a fast inferential procedure, which is applied to synthetic and real data scenarios and shows that feature selection can improve classification accuracy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes