Can LLMs Narrate Tabular Data? An Evaluation Framework for Natural Language Representations of Text-to-SQL System Outputs
This addresses the need for reliable evaluation of LLM-generated natural language outputs in text-to-SQL systems, which is crucial for improving chat agents, but it is incremental as it builds on existing methods.
The paper tackles the problem of evaluating how well large language models (LLMs) convert tabular database results into natural language representations (NLRs) for chat-based interactions, introducing Combo-Eval, which reduces LLM calls by 25-61% and aligns better with human judgments, and NLR-BIRD, the first dedicated dataset for NLR benchmarking.
In modern industry systems like multi-turn chat agents, Text-to-SQL technology bridges natural language (NL) questions and database (DB) querying. The conversion of tabular DB results into NL representations (NLRs) enables the chat-based interaction. Currently, NLR generation is typically handled by large language models (LLMs), but information loss or errors in presenting tabular results in NL remains largely unexplored. This paper introduces a novel evaluation method - Combo-Eval - for judgment of LLM-generated NLRs that combines the benefits of multiple existing methods, optimizing evaluation fidelity and achieving a significant reduction in LLM calls by 25-61%. Accompanying our method is NLR-BIRD, the first dedicated dataset for NLR benchmarking. Through human evaluations, we demonstrate the superior alignment of Combo-Eval with human judgments, applicable across scenarios with and without ground truth references.