Multilingual Text-to-SQL: Benchmarking the Limits of Language Models with Collaborative Language Agents
This addresses the multilingual gap in text-to-SQL for database access, but the results are incremental as they show limited improvement over existing methods.
The paper tackles the problem of multilingual text-to-SQL by introducing MultiSpider 2.0, a benchmark extending Spider 2.0 to eight languages, and finds that state-of-the-art LLMs achieve only 4% execution accuracy with intrinsic reasoning, while a collaboration-driven language agents baseline improves it to 15%.
Text-to-SQL enables natural access to databases, yet most benchmarks are English-only, limiting multilingual progress. We introduce MultiSpider 2.0, extending Spider 2.0 to eight languages (English, German, French, Spanish, Portuguese, Japanese, Chinese, Vietnamese). It preserves Spider 2.0's structural difficulty while adding linguistic and dialectal variability, demanding deeper reasoning for complex SQL. On this benchmark, state-of-the-art LLMs (such as DeepSeek-R1 and OpenAI o1) reach only 4\% execution accuracy when relying on intrinsic reasoning, versus 60\% on MultiSpider 1.0. Therefore, we provide a collaboration-driven language agents baseline that iteratively refines queries, improving accuracy to 15\%. These results reveal a substantial multilingual gap and motivate methods that are robust across languages and ready for real-world enterprise deployment. Our benchmark is available at https://github.com/phkhanhtrinh23/Multilingual_Text_to_SQL.