Text-to-SQL Calibration: No Need to Ask -- Just Rescale Model Probabilities
This work addresses the need for reliable confidence estimation in commercial database applications, but it is incremental as it focuses on improving existing calibration techniques rather than introducing a new paradigm.
The paper tackles the problem of calibrating confidence for large language models in Text-to-SQL tasks, showing that a simple baseline using full-sequence probabilities outperforms more complex methods like self-checking prompts, with evaluations across benchmarks and models providing concrete performance insights.
Calibration is crucial as large language models (LLMs) are increasingly deployed to convert natural language queries into SQL for commercial databases. In this work, we investigate calibration techniques for assigning confidence to generated SQL queries. We show that a straightforward baseline -- deriving confidence from the model's full-sequence probability -- outperforms recent methods that rely on follow-up prompts for self-checking and confidence verbalization. Our comprehensive evaluation, conducted across two widely-used Text-to-SQL benchmarks and multiple LLM architectures, provides valuable insights into the effectiveness of various calibration strategies.