DBAIJul 16, 2024

Schema Matching with Large Language Models: an Experimental Study

arXiv:2407.11852v145 citationsh-index: 15
Originality Synthesis-oriented
AI Analysis

This study addresses schema matching for data engineers by showing LLMs can bootstrap the process without data instances, but it is incremental as it compares prompting methods against a baseline.

The paper tackled schema matching using off-the-shelf large language models (LLMs) to identify semantic correspondences between relational schemas based on names and descriptions, finding that matching quality varies with context in prompts and newer LLM versions increase decisiveness, with some task scopes achieving acceptable verification effort and identifying significant true matches.

Large Language Models (LLMs) have shown useful applications in a variety of tasks, including data wrangling. In this paper, we investigate the use of an off-the-shelf LLM for schema matching. Our objective is to identify semantic correspondences between elements of two relational schemas using only names and descriptions. Using a newly created benchmark from the health domain, we propose different so-called task scopes. These are methods for prompting the LLM to do schema matching, which vary in the amount of context information contained in the prompt. Using these task scopes we compare LLM-based schema matching against a string similarity baseline, investigating matching quality, verification effort, decisiveness, and complementarity of the approaches. We find that matching quality suffers from a lack of context information, but also from providing too much context information. In general, using newer LLM versions increases decisiveness. We identify task scopes that have acceptable verification effort and succeed in identifying a significant number of true semantic matches. Our study shows that LLMs have potential in bootstrapping the schema matching process and are able to assist data engineers in speeding up this task solely based on schema element names and descriptions without the need for data instances.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes