LGITFeb 19, 2019

An entropic feature selection method in perspective of Turing formula

arXiv:1902.07115v11 citations
Originality Incremental advance
AI Analysis

This work addresses feature selection for healthcare data analytics, but it is incremental as it builds on existing entropy-based methods with improvements for small sample sizes.

The paper tackles the challenge of feature selection in healthcare datasets, which are complex and have small sample sizes, by developing a method based on Coverage Adjusted Standardized Mutual Information (CASMI). The result shows that the proposed method performs better than six existing methods, especially with small sample sizes, as measured by the Information Recovery Ratio.

Health data are generally complex in type and small in sample size. Such domain-specific challenges make it difficult to capture information reliably and contribute further to the issue of generalization. To assist the analytics of healthcare datasets, we develop a feature selection method based on the concept of Coverage Adjusted Standardized Mutual Information (CASMI). The main advantages of the proposed method are: 1) it selects features more efficiently with the help of an improved entropy estimator, particularly when the sample size is small, and 2) it automatically learns the number of features to be selected based on the information from sample data. Additionally, the proposed method handles feature redundancy from the perspective of joint-distribution. The proposed method focuses on non-ordinal data, while it works with numerical data with an appropriate binning method. A simulation study comparing the proposed method to six widely cited feature selection methods shows that the proposed method performs better when measured by the Information Recovery Ratio, particularly when the sample size is small.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes