LGMay 10

LLM-Driven Performance-Space Augmentation for Meta-Learning-Based Algorithm Selection

arXiv:2605.0951817.2

Predicted impact top 87% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For researchers in meta-learning and automated machine learning, this work provides a method to improve algorithm selection by augmenting sparse meta-datasets with LLM-generated synthetic data.

The paper addresses the sparsity of meta-datasets in meta-learning for algorithm selection by augmenting them with synthetic datasets generated by an LLM. Uniform augmentation across the performance space achieved a 17.47% reduction in Hamming loss, 100.41% improvement in subset accuracy, and +6.09% gain in pooled out-of-fold R² over the unaugmented baseline.

Meta-learning for algorithm selection relies on a meta-dataset in which each row corresponds to a supervised learning dataset described by meta-features and labelled with a target value that is associated with algorithm choice (typically, some function of algorithm performance). A persistent limitation is that the number of curated real-world datasets is small, resulting in sparse meta-datasets that constrain meta-learner generalisation. In this paper, we address this problem by augmenting the meta-dataset with synthetic regression datasets produced via a large language model (LLM), with generation steered toward target regions of a low-dimensionality performance space. In our experiments, we adopt a two-dimensional geometric setting defined by the cross-validated $R^2$ scores of two anchor algorithms, known as landmarkers. We compare two augmentation strategies: (1) uniform sampling, which distributes synthetic datasets across the performance space; and (2) margin-based sampling, which concentrates them near the decision boundary where landmarker preference is most ambiguous. Across 42 real-world UCI regression datasets and 730 synthetic datasets, both strategies substantially improve meta-learner performance over the unaugmented baseline under regression and multi-label evaluation formulations. However, uniform augmentation consistently outperforms margin-based augmentation, achieving a 17.47% relative reduction in Hamming loss, a 100.41% relative improvement in subset accuracy, and a +6.09% relative gain in pooled out-of-fold $R^2$. These results lead us to postulate a central thesis: the performance of algorithms resides on a low-dimensional performance manifold, whose reconstruction bias may be minimised by user-guided LLMs that seek to maximise uniform $ε$-cover, and consequently, lead to improved meta-learning for algorithm selection.

View on arXiv PDF

Similar