Automatic Demonstration Selection for LLM-based Tabular Data Classification
This work addresses a specific bottleneck in applying LLMs to tabular data classification, offering an incremental improvement for researchers and practitioners in machine learning.
The paper tackles the problem of determining the optimal number of demonstrations for in-context learning in tabular data classification by introducing an algorithm that automatically selects demonstrations based on data distribution, prompt template, and LLM specifics, achieving improved performance over random selection methods in experiments.
A fundamental question in applying In-Context Learning (ICL) for tabular data classification is how to determine the ideal number of demonstrations in the prompt. This work addresses this challenge by presenting an algorithm to automatically select a reasonable number of required demonstrations. Our method distinguishes itself by integrating not only the tabular data's distribution but also the user's selected prompt template and the specific Large Language Model (LLM) into its estimation. Rooted in Spectral Graph Theory, our proposed algorithm defines a novel metric to quantify the similarities between different demonstrations. We then construct a similarity graph and analyze the eigenvalues of its Laplacian to derive the minimum number of demonstrations capable of representing the data within the LLM's intrinsic representation space. We validate the efficacy of our approach through experiments comparing its performance against conventional random selection algorithms on diverse datasets and LLMs.