LGFeb 28, 2023

Towards Personalized Preprocessing Pipeline Search

Diego Martinez, Daochen Zha, Qiaoyu Tan, Xia Hu

arXiv:2302.14329v13.82 citationsh-index: 33

Originality Incremental advance

AI Analysis

This addresses a bottleneck in AutoML for data scientists by enabling more tailored feature preprocessing, though it is an incremental improvement over existing methods.

The paper tackles the problem of sub-optimal performance in AutoML due to limited preprocessing pipeline search by proposing personalized preprocessing for each feature, resulting in improved performance on benchmark datasets.

Feature preprocessing, which transforms raw input features into numerical representations, is a crucial step in automated machine learning (AutoML) systems. However, the existing systems often have a very small search space for feature preprocessing with the same preprocessing pipeline applied to all the numerical features. This may result in sub-optimal performance since different datasets often have various feature characteristics, and features within a dataset may also have their own preprocessing preferences. To bridge this gap, we explore personalized preprocessing pipeline search, where the search algorithm is allowed to adopt a different preprocessing pipeline for each feature. This is a challenging task because the search space grows exponentially with more features. To tackle this challenge, we propose ClusterP3S, a novel framework for Personalized Preprocessing Pipeline Search via Clustering. The key idea is to learn feature clusters such that the search space can be significantly reduced by using the same preprocessing pipeline for the features within a cluster. To this end, we propose a hierarchical search strategy to jointly learn the clusters and search for the optimal pipelines, where the upper-level search optimizes the feature clustering to enable better pipelines built upon the clusters, and the lower-level search optimizes the pipeline given a specific cluster assignment. We instantiate this idea with a deep clustering network that is trained with reinforcement learning at the upper level, and random search at the lower level. Experiments on benchmark classification datasets demonstrate the effectiveness of enabling feature-wise preprocessing pipeline search.

View on arXiv PDF

Similar