ML LGJun 12, 2020

Fast Maximum Likelihood Estimation and Supervised Classification for the Beta-Liouville Multinomial

arXiv:2006.07454v1

Originality Incremental advance

AI Analysis

This work addresses the need for more flexible categorical data modeling in fields like bioinformatics and NLP, though it appears incremental as it builds on existing distribution variants.

The paper tackles the problem of modeling categorical data with distributions that make strict assumptions, which can lead to poor parameter estimates and classification accuracy. It shows that the Beta-Liouville multinomial matches or exceeds the efficiency and performance of standard distributions on simulated data and outperforms them on two out of four gold standard datasets.

The multinomial and related distributions have long been used to model categorical, count-based data in fields ranging from bioinformatics to natural language processing. Commonly utilized variants include the standard multinomial and the Dirichlet multinomial distributions due to their computational efficiency and straightforward parameter estimation process. However, these distributions make strict assumptions about the mean, variance, and covariance between the categorical features being modeled. If these assumptions are not met by the data, it may result in poor parameter estimates and loss in accuracy for downstream applications like classification. Here, we explore efficient parameter estimation and supervised classification methods using an alternative distribution, called the Beta-Liouville multinomial, which relaxes some of the multinomial assumptions. We show that the Beta-Liouville multinomial is comparable in efficiency to the Dirichlet multinomial for Newton-Raphson maximum likelihood estimation, and that its performance on simulated data matches or exceeds that of the multinomial and Dirichlet multinomial distributions. Finally, we demonstrate that the Beta-Liouville multinomial outperforms the multinomial and Dirichlet multinomial on two out of four gold standard datasets, supporting its use in modeling data with low to medium class overlap in a supervised classification context.

View on arXiv PDF

Similar