Jianling Gao

CLJan 8

EvolSQL: Structure-Aware Evolution for Scalable Text-to-SQL Data Synthesis

Xuanguang Pan, Chongyang Tao, Jiayuan Bai et al.

Training effective Text-to-SQL models remains challenging due to the scarcity of high-quality, diverse, and structurally complex datasets. Existing methods either rely on limited human-annotated corpora, or synthesize datasets directly by simply prompting LLMs without explicit control over SQL structures, often resulting in limited structural diversity and complexity. To address this, we introduce EvolSQL, a structure-aware data synthesis framework that evolves SQL queries from seed data into richer and more semantically diverse forms. EvolSQL starts with an exploratory Query-SQL expansion to broaden question diversity and improve schema coverage, and then applies an adaptive directional evolution strategy using six atomic transformation operators derived from the SQL Abstract Syntax Tree to progressively increase query complexity across relational, predicate, aggregation, and nesting dimensions. An execution-grounded SQL refinement module and schema-aware deduplication further ensure the creation of high-quality, structurally diverse mapping pairs. Experimental results show that a 7B model fine-tuned on our data outperforms one trained on the much larger SynSQL dataset using only 1/18 of the data.

LGNov 26, 2025

SetAD: Semi-Supervised Anomaly Learning in Contextual Sets

Jianling Gao, Chongyang Tao, Xuelian Lin et al.

Semi-supervised anomaly detection (AD) has shown great promise by effectively leveraging limited labeled data. However, existing methods are typically structured around scoring individual points or simple pairs. Such {point- or pair-centric} view not only overlooks the contextual nature of anomalies, which are defined by their deviation from a collective group, but also fails to exploit the rich supervisory signals that can be generated from the combinatorial composition of sets. Consequently, such models struggle to exploit the high-order interactions within the data, which are critical for learning discriminative representations. To address these limitations, we propose SetAD, a novel framework that reframes semi-supervised AD as a Set-level Anomaly Detection task. SetAD employs an attention-based set encoder trained via a graded learning objective, where the model learns to quantify the degree of anomalousness within an entire set. This approach directly models the complex group-level interactions that define anomalies. Furthermore, to enhance robustness and score calibration, we propose a context-calibrated anomaly scoring mechanism, which assesses a point's anomaly score by aggregating its normalized deviations from peer behavior across multiple, diverse contextual sets. Extensive experiments on 10 real-world datasets demonstrate that SetAD significantly outperforms state-of-the-art models. Notably, we show that our model's performance consistently improves with increasing set size, providing strong empirical support for the set-based formulation of anomaly detection.

Jianling Gao

2 Papers