LGMLApr 28, 2024

Naive Bayes Classifiers and One-hot Encoding of Categorical Variables

arXiv:2404.18190v14 citationsh-index: 1
Originality Synthesis-oriented
AI Analysis

This work addresses a potential modeling error in machine learning for practitioners using Naïve Bayes with categorical data, though it is incremental as it focuses on a specific encoding issue.

The paper investigates the impact of incorrectly using one-hot encoding for categorical variables in Naïve Bayes classifiers, leading to a product-of-Bernoullis assumption instead of the correct categorical model. Experimental results show that both classifiers often agree on the maximum a posteriori class label, but the product-of-Bernoullis case typically yields higher posterior probabilities.

This paper investigates the consequences of encoding a $K$-valued categorical variable incorrectly as $K$ bits via one-hot encoding, when using a Naïve Bayes classifier. This gives rise to a product-of-Bernoullis (PoB) assumption, rather than the correct categorical Naïve Bayes classifier. The differences between the two classifiers are analysed mathematically and experimentally. In our experiments using probability vectors drawn from a Dirichlet distribution, the two classifiers are found to agree on the maximum a posteriori class label for most cases, although the posterior probabilities are usually greater for the PoB case.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes