LGMar 18

Embedding World Knowledge into Tabular Models: Towards Best Practices for Embedding Pipeline Design

arXiv:2603.1773746.8h-index: 3
AI Analysis

This work provides incremental guidance for researchers and practitioners on designing effective embedding pipelines to incorporate world knowledge into tabular models.

The authors systematically benchmarked 256 LLM-based embedding pipeline configurations for tabular prediction, finding that performance depends heavily on design choices, with concatenating embeddings and larger models generally improving results, while gradient boosting decision trees are strong downstream models.

Embeddings are a powerful way to enrich data-driven machine learning models with the world knowledge of large language models (LLMs). Yet, there is limited evidence on how to design effective LLM-based embedding pipelines for tabular prediction. In this work, we systematically benchmark 256 pipeline configurations, covering 8 preprocessing strategies, 16 embedding models, and 2 downstream models. Our results show that it strongly depends on the specific pipeline design whether incorporating the prior knowledge of LLMs improves the predictive performance. In general, concatenating embeddings tends to outperform replacing the original columns with embeddings. Larger embedding models tend to yield better results, while public leaderboard rankings and model popularity are poor performance indicators. Finally, gradient boosting decision trees tend to be strong downstream models. Our findings provide researchers and practitioners with guidance for building more effective embedding pipelines for tabular prediction tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes