A Survey on Data-Centric AI: Tabular Learning from Reinforcement Learning and Generative AI Perspective
This survey is significant for researchers and practitioners working with tabular data across domains such as bioinformatics, healthcare, and marketing, providing insights for continued innovation in data-centric AI, although it is an incremental work building upon existing techniques.
This survey tackles the problem of optimizing tabular data quality, focusing on reinforcement learning and generative approaches for feature selection and generation, with the result being a comprehensive review of existing methods and future research directions. The survey does not provide concrete numbers, but it aims to enhance model performance in various domains.
Tabular data is one of the most widely used data formats across various domains such as bioinformatics, healthcare, and marketing. As artificial intelligence moves towards a data-centric perspective, improving data quality is essential for enhancing model performance in tabular data-driven applications. This survey focuses on data-driven tabular data optimization, specifically exploring reinforcement learning (RL) and generative approaches for feature selection and feature generation as fundamental techniques for refining data spaces. Feature selection aims to identify and retain the most informative attributes, while feature generation constructs new features to better capture complex data patterns. We systematically review existing generative methods for tabular data engineering, analyzing their latest advancements, real-world applications, and respective strengths and limitations. This survey emphasizes how RL-based and generative techniques contribute to the automation and intelligence of feature engineering. Finally, we summarize the existing challenges and discuss future research directions, aiming to provide insights that drive continued innovation in this field.