IRAICLDBAug 18, 2025

Text-to-SQL Oriented to the Process Mining Domain: A PT-EN Dataset for Query Translation

arXiv:2509.09684v1h-index: 6
Originality Synthesis-oriented
AI Analysis

It addresses the challenge of natural language querying for databases in process mining, increasing accessibility for non-experts, but is incremental as it focuses on creating a dataset rather than a new method.

This paper tackles the problem of text-to-SQL conversion in the process mining domain by introducing a bilingual Portuguese-English dataset, text-2-SQL-4-PM, comprising 1,655 natural language utterances and 205 SQL statements, with a baseline study using GPT-3.5 Turbo demonstrating its feasibility.

This paper introduces text-2-SQL-4-PM, a bilingual (Portuguese-English) benchmark dataset designed for the text-to-SQL task in the process mining domain. Text-to-SQL conversion facilitates natural language querying of databases, increasing accessibility for users without SQL expertise and productivity for those that are experts. The text-2-SQL-4-PM dataset is customized to address the unique challenges of process mining, including specialized vocabularies and single-table relational structures derived from event logs. The dataset comprises 1,655 natural language utterances, including human-generated paraphrases, 205 SQL statements, and ten qualifiers. Methods include manual curation by experts, professional translations, and a detailed annotation process to enable nuanced analyses of task complexity. Additionally, a baseline study using GPT-3.5 Turbo demonstrates the feasibility and utility of the dataset for text-to-SQL applications. The results show that text-2-SQL-4-PM supports evaluation of text-to-SQL implementations, offering broader applicability for semantic parsing and other natural language processing tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes