DBAILGDec 29, 2024

Mind the Data Gap: Bridging LLMs to Enterprise Data Integration

arXiv:2412.20331v113 citationsh-index: 12
Originality Incremental advance
AI Analysis

This addresses the data gap for enterprises using LLMs in data integration, but is incremental as it adapts existing methods to a new domain.

The paper tackles the problem of LLMs performing poorly on enterprise data due to training on public data, and shows that their proposed techniques (hierarchical annotation, runtime class-learning, and ontology synthesis) bring performance on enterprise data up to par with public data.

Leading large language models (LLMs) are trained on public data. However, most of the world's data is dark data that is not publicly accessible, mainly in the form of private organizational or enterprise data. We show that the performance of methods based on LLMs seriously degrades when tested on real-world enterprise datasets. Current benchmarks, based on public data, overestimate the performance of LLMs. We release a new benchmark dataset, the GOBY Benchmark, to advance discovery in enterprise data integration. Based on our experience with this enterprise benchmark, we propose techniques to uplift the performance of LLMs on enterprise data, including (1) hierarchical annotation, (2) runtime class-learning, and (3) ontology synthesis. We show that, once these techniques are deployed, the performance on enterprise data becomes on par with that of public data. The Goby benchmark can be obtained at https://goby-benchmark.github.io/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes