DB AIJul 9, 2023

LakeBench: Benchmarks for Data Discovery over Data Lakes

Kavitha Srinivas, Julian Dolby, Ibrahim Abdelaziz, Oktie Hassanzadeh, Harsha Kokel, Aamod Khatiwada, Tejaswini Pedapati, Subhajit Chaudhury, Horst Samulowitz

IBM

arXiv:2307.04217v19.217 citationsh-index: 35

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of data discovery for enterprises managing data lakes, but it is incremental as it primarily establishes new benchmarks rather than introducing novel methods.

The authors tackled the lack of public benchmarks for data discovery tasks in data lakes by creating LakeBench, which includes benchmarks for finding unionable, joinable, or subset-related tables using diverse data sources, and they found that existing tabular foundational models performed poorly on these tasks, indicating significant room for improvement.

Within enterprises, there is a growing need to intelligently navigate data lakes, specifically focusing on data discovery. Of particular importance to enterprises is the ability to find related tables in data repositories. These tables can be unionable, joinable, or subsets of each other. There is a dearth of benchmarks for these tasks in the public domain, with related work targeting private datasets. In LakeBench, we develop multiple benchmarks for these tasks by using the tables that are drawn from a diverse set of data sources such as government data from CKAN, Socrata, and the European Central Bank. We compare the performance of 4 publicly available tabular foundational models on these tasks. None of the existing models had been trained on the data discovery tasks that we developed for this benchmark; not surprisingly, their performance shows significant room for improvement. The results suggest that the establishment of such benchmarks may be useful to the community to build tabular models usable for data discovery in data lakes.

View on arXiv PDF

Similar