LG AISep 20, 2025

Multi-level Diagnosis and Evaluation for Robust Tabular Feature Engineering with Large Language Models

arXiv:2509.25207v11 citationsEMNLP

Originality Incremental advance

AI Analysis

This work addresses reliability concerns for practitioners using LLMs in feature engineering, though it is incremental as it builds on existing LLM applications.

The paper tackles the problem of unreliable feature engineering by large language models (LLMs) for tabular data by introducing a multi-level diagnosis and evaluation framework to assess robustness across domains, showing that high-quality LLM-generated features can improve few-shot prediction performance by up to 10.52%.

Recent advancements in large language models (LLMs) have shown promise in feature engineering for tabular data, but concerns about their reliability persist, especially due to variability in generated outputs. We introduce a multi-level diagnosis and evaluation framework to assess the robustness of LLMs in feature engineering across diverse domains, focusing on the three main factors: key variables, relationships, and decision boundary values for predicting target classes. We demonstrate that the robustness of LLMs varies significantly over different datasets, and that high-quality LLM-generated features can improve few-shot prediction performance by up to 10.52%. This work opens a new direction for assessing and enhancing the reliability of LLM-driven feature engineering in various domains.

View on arXiv PDF

Similar