SEAIMay 6, 2025

Synthline: A Product Line Approach for Synthetic Requirements Engineering Data Generation using Large Language Models

arXiv:2505.03265v12 citationsh-index: 2RCIS
Originality Incremental advance
AI Analysis

This addresses data scarcity in Requirements Engineering for practitioners using ML, though it is incremental as it builds on existing synthetic data generation methods.

The paper tackles the scarcity of high-quality datasets in Requirements Engineering by introducing Synthline, a Product Line approach using Large Language Models to generate synthetic data, and finds that combining synthetic and real data improves precision by up to 85% and recall by 2x compared to using only real data.

While modern Requirements Engineering (RE) heavily relies on natural language processing and Machine Learning (ML) techniques, their effectiveness is limited by the scarcity of high-quality datasets. This paper introduces Synthline, a Product Line (PL) approach that leverages Large Language Models to systematically generate synthetic RE data for classification-based use cases. Through an empirical evaluation conducted in the context of using ML for the identification of requirements specification defects, we investigated both the diversity of the generated data and its utility for training downstream models. Our analysis reveals that while synthetic datasets exhibit less diversity than real data, they are good enough to serve as viable training resources. Moreover, our evaluation shows that combining synthetic and real data leads to substantial performance improvements. Specifically, hybrid approaches achieve up to 85% improvement in precision and a 2x increase in recall compared to models trained exclusively on real data. These findings demonstrate the potential of PL-based synthetic data generation to address data scarcity in RE. We make both our implementation and generated datasets publicly available to support reproducibility and advancement in the field.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes