DBAISep 9, 2024

A System and Benchmark for LLM-based Q&A on Heterogeneous Data

IBM
arXiv:2409.05735v21 citationsh-index: 26
AI Analysis

This addresses the challenge for users in industrial environments who need to query multiple, siloed data sources without technical expertise, though it appears incremental as it builds on existing Text-to-SQL methods.

The paper tackles the problem of enabling natural language question answering across heterogeneous data sources like databases and APIs in industrial settings, introducing the siwarex platform and demonstrating its effectiveness on a modified Spider benchmark.

In many industrial settings, users wish to ask questions whose answers may be found in structured data sources such as a spreadsheets, databases, APIs, or combinations thereof. Often, the user doesn't know how to identify or access the right data source. This problem is compounded even further if multiple (and potentially siloed) data sources must be assembled to derive the answer. Recently, various Text-to-SQL applications that leverage Large Language Models (LLMs) have addressed some of these problems by enabling users to ask questions in natural language. However, these applications remain impractical in realistic industrial settings because they fail to cope with the data source heterogeneity that typifies such environments. In this paper, we address heterogeneity by introducing the siwarex platform, which enables seamless natural language access to both databases and APIs. To demonstrate the effectiveness of siwarex, we extend the popular Spider dataset and benchmark by replacing some of its tables by data retrieval APIs. We find that siwarex does a good job of coping with data source heterogeneity. Our modified Spider benchmark will soon be available to the research community

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes