LG IR MLMay 3, 2013

Feature Selection Based on Term Frequency and T-Test for Text Categorization

Deqing Wang, Hui Zhang, Rui Liu, Weifeng Lv

arXiv:1305.0638v129 citations

Originality Incremental advance

AI Analysis

This work addresses feature selection for text categorization, offering a slight improvement over existing methods, making it incremental in nature.

The paper tackled the problem of feature selection in text categorization by addressing the unreliability of existing methods for low-frequency terms and their neglect of term frequency, proposing a new approach based on t-test to measure term distribution diversity. The result showed that this method is comparable to or slightly better than state-of-the-art methods like Chi-Square and Information Gain in terms of macro-F1 and micro-F1 scores on two text corpora.

Much work has been done on feature selection. Existing methods are based on document frequency, such as Chi-Square Statistic, Information Gain etc. However, these methods have two shortcomings: one is that they are not reliable for low-frequency terms, and the other is that they only count whether one term occurs in a document and ignore the term frequency. Actually, high-frequency terms within a specific category are often regards as discriminators. This paper focuses on how to construct the feature selection function based on term frequency, and proposes a new approach based on $t$-test, which is used to measure the diversity of the distributions of a term between the specific category and the entire corpus. Extensive comparative experiments on two text corpora using three classifiers show that our new approach is comparable to or or slightly better than the state-of-the-art feature selection methods (i.e., $χ^2$, and IG) in terms of macro-$F_1$ and micro-$F_1$.

View on arXiv PDF

Similar