MLCLIRLGFeb 9, 2016

Toward Optimal Feature Selection in Naive Bayes for Text Categorization

arXiv:1602.02850v1245 citations
Originality Incremental advance
AI Analysis

This work addresses feature selection to improve efficiency and performance in text categorization, presenting an incremental advancement over existing methods.

The paper tackles feature selection in naive Bayes for text categorization by introducing a new divergence measure, Jeffreys-Multi-Hypothesis (JMH) divergence, and developing two efficient methods based on it, achieving promising results in experiments.

Automated feature selection is important for text categorization to reduce the feature size and to speed up the learning process of classifiers. In this paper, we present a novel and efficient feature selection framework based on the Information Theory, which aims to rank the features with their discriminative capacity for classification. We first revisit two information measures: Kullback-Leibler divergence and Jeffreys divergence for binary hypothesis testing, and analyze their asymptotic properties relating to type I and type II errors of a Bayesian classifier. We then introduce a new divergence measure, called Jeffreys-Multi-Hypothesis (JMH) divergence, to measure multi-distribution divergence for multi-class classification. Based on the JMH-divergence, we develop two efficient feature selection methods, termed maximum discrimination ($MD$) and $MD-χ^2$ methods, for text categorization. The promising results of extensive experiments demonstrate the effectiveness of the proposed approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes