DB AISep 4, 2025

Schema Inference for Tabular Data Repositories Using Large Language Models

Zhenyu Wu, Jiaoyan Chen, Norman W. Paton

arXiv:2509.04632v11.21 citationsh-index: 2Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of working with heterogeneous, minimally curated tabular data for data scientists and analysts, though it appears incremental as it builds on prior dataset discovery work.

The paper tackles the problem of schema inference for tabular data with limited metadata by presenting SI-LLM, which uses large language models to infer hierarchical schemas from column headers and cell values, achieving promising results comparable to state-of-the-art methods on web tables and open data.

Minimally curated tabular data often contain representational inconsistencies across heterogeneous sources, and are accompanied by sparse metadata. Working with such data is intimidating. While prior work has advanced dataset discovery and exploration, schema inference remains difficult when metadata are limited. We present SI-LLM (Schema Inference using Large Language Models), which infers a concise conceptual schema for tabular data using only column headers and cell values. The inferred schema comprises hierarchical entity types, attributes, and inter-type relationships. In extensive evaluation on two datasets from web tables and open data, SI-LLM achieves promising end-to-end results, as well as better or comparable results to state-of-the-art methods at each step. All source code, full prompts, and datasets of SI-LLM are available at https://github.com/PierreWoL/SILLM.

View on arXiv PDF Code

Similar