ML LGApr 30, 2018

On the Effect of Suboptimal Estimation of Mutual Information in Feature Selection and Classification

arXiv:1804.11021v31.0

Originality Incremental advance

AI Analysis

This work addresses the challenge of accurately ranking dependencies between continuous and discrete variables for researchers in statistics and machine learning, though it is incremental as it builds on existing estimators.

The paper tackled the problem of suboptimal estimation of mutual information in feature selection and classification by introducing the estimator response curve, a new property to assess estimator performance, and showed that the CIM estimator outperforms others like kNN and vME on real-world datasets with varying dimensions and sizes.

This paper introduces a new property of estimators of the strength of statistical association, which helps characterize how well an estimator will perform in scenarios where dependencies between continuous and discrete random variables need to be rank ordered. The new property, termed the estimator response curve, is easily computable and provides a marginal distribution agnostic way to assess an estimator's performance. It overcomes notable drawbacks of current metrics of assessment, including statistical power, bias, and consistency. We utilize the estimator response curve to test various measures of the strength of association that satisfy the data processing inequality (DPI), and show that the CIM estimator's performance compares favorably to kNN, vME, AP, and H_{MI} estimators of mutual information. The estimators which were identified to be suboptimal, according to the estimator response curve, perform worse than the more optimal estimators when tested with real-world data from four different areas of science, all with varying dimensionalities and sizes.

View on arXiv PDF

Similar