LG MLMar 23, 2025

Interpretable Feature Interaction via Statistical Self-supervised Learning on Tabular Data

arXiv:2503.18048v1h-index: 1Machine Learning: Science and Technology

Originality Incremental advance

AI Analysis

This work addresses the need for transparent and statistically rigorous feature extraction in high-stakes domains like healthcare or finance, though it appears incremental as it builds on existing techniques like kernel PCA and knockoff selection.

The paper tackles the challenge of achieving both interpretability and statistical guarantees in feature extraction from complex tabular data by introducing Spofe, a self-supervised pipeline that combines kernel principal components with sparse polynomial representations and knockoff selection, resulting in improved feature selection performance over methods like KPCA and SKPCA in regression and classification tasks.

In high-dimensional and high-stakes contexts, ensuring both rigorous statistical guarantees and interpretability in feature extraction from complex tabular data remains a formidable challenge. Traditional methods such as Principal Component Analysis (PCA) reduce dimensionality and identify key features that explain the most variance, but are constrained by their reliance on linear assumptions. In contrast, neural networks offer assumption-free feature extraction through self-supervised learning techniques such as autoencoders, though their interpretability remains a challenge in fields requiring transparency. To address this gap, this paper introduces Spofe, a novel self-supervised machine learning pipeline that marries the power of kernel principal components for capturing nonlinear dependencies with a sparse and principled polynomial representation to achieve clear interpretability with statistical rigor. Underpinning our approach is a robust theoretical framework that delivers precise error bounds and rigorous false discovery rate (FDR) control via a multi-objective knockoff selection procedure; it effectively bridges the gap between data-driven complexity and statistical reliability via three stages: (1) generating self-supervised signals using kernel principal components to model complex patterns, (2) distilling these signals into sparse polynomial functions for improved interpretability, and (3) applying a multi-objective knockoff selection procedure with significance testing to rigorously identify important features. Extensive experiments on diverse real-world datasets demonstrate the effectiveness of Spofe, consistently surpassing KPCA, SKPCA, and other methods in feature selection for regression and classification tasks. Visualization and case studies highlight its ability to uncover key insights, enhancing interpretability and practical utility.

View on arXiv PDF

Similar