CLAIIRLGDec 17, 2024

Auto-Cypher: Improving LLMs on Cypher generation via LLM-supervised generation-verification framework

arXiv:2412.12612v216 citationsh-index: 15Has CodeNAACL
Originality Incremental advance
AI Analysis

This work addresses a specific bottleneck in graph database querying for developers and researchers, but it is incremental as it builds on existing Text2SQL methods by adapting them to a new query language.

The paper tackled the problem of generating Cypher queries for graph databases from natural language, which is less explored than SQL generation, by developing an automated pipeline to create synthetic training data, resulting in performance gains of up to 40% on a Text2Cypher test and 30% on an adapted SPIDER benchmark for open-source LLMs.

Graph databases like Neo4j are gaining popularity for handling complex, interconnected data, over traditional relational databases in modeling and querying relationships. While translating natural language into SQL queries is well-researched, generating Cypher queries for Neo4j remains relatively underexplored. In this work, we present an automated, LLM-Supervised, pipeline to generate high-quality synthetic data for Text2Cypher. Our Cypher data generation pipeline introduces LLM-As-Database-Filler, a novel strategy for ensuring Cypher query correctness, thus resulting in high quality generations. Using our pipeline, we generate high quality Text2Cypher data - SynthCypher containing 29.8k instances across various domains and queries with varying complexities. Training open-source LLMs like LLaMa-3.1-8B, Mistral-7B, and QWEN-7B on SynthCypher results in performance gains of up to 40% on the Text2Cypher test split and 30% on the SPIDER benchmark, adapted for graph databases.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes