DBAILGJul 18, 2025

Schemora: schema matching via multi-stage recommendation and metadata enrichment using off-the-shelf llms

arXiv:2507.14376v1Has Code
Originality Highly original
AI Analysis

This addresses the resource-intensive challenge of schema matching for data integration and dataset discovery, offering a novel LLM-based solution with open-source implementation.

The paper tackles the problem of schema matching for integrating heterogeneous data sources by introducing SCHEMORA, a framework that uses large language models and hybrid retrieval techniques to improve accuracy and scalability without labeled data, achieving state-of-the-art gains of 7.49% in HitRate@5 and 3.75% in HitRate@3 on the MIMIC-OMOP benchmark.

Schema matching is essential for integrating heterogeneous data sources and enhancing dataset discovery, yet it remains a complex and resource-intensive problem. We introduce SCHEMORA, a schema matching framework that combines large language models with hybrid retrieval techniques in a prompt-based approach, enabling efficient identification of candidate matches without relying on labeled training data or exhaustive pairwise comparisons. By enriching schema metadata and leveraging both vector-based and lexical retrieval, SCHEMORA improves matching accuracy and scalability. Evaluated on the MIMIC-OMOP benchmark, it establishes new state-of-the-art performance, with gains of 7.49% in HitRate@5 and 3.75% in HitRate@3 over previous best results. To our knowledge, this is the first LLM-based schema matching method with an open-source implementation, accompanied by analysis that underscores the critical role of retrieval and provides practical guidance on model selection.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes