Auto-Cypher: Improving LLMs on Cypher generation via LLM-supervised generation-verification framework
This work addresses a specific bottleneck in graph database querying for developers and researchers, but it is incremental as it builds on existing Text2SQL methods by adapting them to a new query language.
The paper tackled the problem of generating Cypher queries for graph databases from natural language, which is less explored than SQL generation, by developing an automated pipeline to create synthetic training data, resulting in performance gains of up to 40% on a Text2Cypher test and 30% on an adapted SPIDER benchmark for open-source LLMs.
Graph databases like Neo4j are gaining popularity for handling complex, interconnected data, over traditional relational databases in modeling and querying relationships. While translating natural language into SQL queries is well-researched, generating Cypher queries for Neo4j remains relatively underexplored. In this work, we present an automated, LLM-Supervised, pipeline to generate high-quality synthetic data for Text2Cypher. Our Cypher data generation pipeline introduces LLM-As-Database-Filler, a novel strategy for ensuring Cypher query correctness, thus resulting in high quality generations. Using our pipeline, we generate high quality Text2Cypher data - SynthCypher containing 29.8k instances across various domains and queries with varying complexities. Training open-source LLMs like LLaMa-3.1-8B, Mistral-7B, and QWEN-7B on SynthCypher results in performance gains of up to 40% on the Text2Cypher test split and 30% on the SPIDER benchmark, adapted for graph databases.