LGAIFeb 7, 2024

An information theoretic approach to quantify the stability of feature selection and ranking algorithms

arXiv:2402.05295v115 citationsh-index: 38Knowledge-Based Systems
Originality Incremental advance
AI Analysis

This work addresses the need for reliable stability assessment in feature selection, which is crucial for applications like knowledge discovery and food quality assessment, though it is incremental as it builds on existing stability measures.

The paper tackles the problem of instability in feature selection algorithms, where small data variations lead to inconsistent feature rankings, by proposing an information-theoretic metric based on Jensen-Shannon divergence to quantify robustness across different algorithm outcomes, demonstrating its effectiveness in controlled and real-world experiments.

Feature selection is a key step when dealing with high dimensional data. In particular, these techniques simplify the process of knowledge discovery from the data by selecting the most relevant features out of the noisy, redundant and irrelevant features. A problem that arises in many of these practical applications is that the outcome of the feature selection algorithm is not stable. Thus, small variations in the data may yield very different feature rankings. Assessing the stability of these methods becomes an important issue in the previously mentioned situations. We propose an information theoretic approach based on the Jensen Shannon divergence to quantify this robustness. Unlike other stability measures, this metric is suitable for different algorithm outcomes: full ranked lists, feature subsets as well as the lesser studied partial ranked lists. This generalized metric quantifies the difference among a whole set of lists with the same size, following a probabilistic approach and being able to give more importance to the disagreements that appear at the top of the list. Moreover, it possesses desirable properties including correction for change, upper lower bounds and conditions for a deterministic selection. We illustrate the use of this stability metric with data generated in a fully controlled way and compare it with popular metrics including the Spearmans rank correlation and the Kunchevas index on feature ranking and selection outcomes, respectively. Additionally, experimental validation of the proposed approach is carried out on a real-world problem of food quality assessment showing its potential to quantify stability from different perspectives.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes