DBAISep 4, 2025

Schema Inference for Tabular Data Repositories Using Large Language Models

arXiv:2509.04632v11 citationsh-index: 2Has Code
Originality Incremental advance
AI Analysis

This addresses the challenge of working with heterogeneous, minimally curated tabular data for data scientists and analysts, though it appears incremental as it builds on prior dataset discovery work.

The paper tackles the problem of schema inference for tabular data with limited metadata by presenting SI-LLM, which uses large language models to infer hierarchical schemas from column headers and cell values, achieving promising results comparable to state-of-the-art methods on web tables and open data.

Minimally curated tabular data often contain representational inconsistencies across heterogeneous sources, and are accompanied by sparse metadata. Working with such data is intimidating. While prior work has advanced dataset discovery and exploration, schema inference remains difficult when metadata are limited. We present SI-LLM (Schema Inference using Large Language Models), which infers a concise conceptual schema for tabular data using only column headers and cell values. The inferred schema comprises hierarchical entity types, attributes, and inter-type relationships. In extensive evaluation on two datasets from web tables and open data, SI-LLM achieves promising end-to-end results, as well as better or comparable results to state-of-the-art methods at each step. All source code, full prompts, and datasets of SI-LLM are available at https://github.com/PierreWoL/SILLM.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes