SEApr 15

Characterizing Datasets for LLM-based Requirements Engineering: A Systematic Mapping Study

Quim Motger, Carlota Catot, Xavier Franch

arXiv:2510.1878720.7h-index: 8

AI Analysis

This work addresses the need for better dataset organization and reuse in LLM-based RE research, which is incremental as it synthesizes existing data rather than introducing new methods.

The paper tackled the problem of data scarcity and lack of systematic characterization for datasets used in LLM-based Requirements Engineering (RE) by conducting a systematic mapping study of 45 studies and 62 datasets, revealing imbalances such as incomplete open-science practices and limited support for elicitation activities.

Large Language Models (LLMs) depend on high-quality, domain-specific natural language datasets. This dependency is particularly pronounced in Requirements Engineering (RE), where core activities rely on textual artifacts such as requirements, specifications, and stakeholder feedback. Despite the increasing use of LLMs in RE, data scarcity remains a widely reported limitation. While several datasets support LLM-based RE research, they are scattered across studies and lack systematic characterization, hindering reuse, comparability and assessment. This paper addresses this gap by examining which public datasets are used in LLM-based RE, how they can be consistently characterized, and which RE tasks and dataset properties remain under-represented. We report on a systematic mapping study of 45 primary studies referencing 62 publicly available datasets. Each dataset is characterized using a structured scheme covering multiple dimensions, including relevant descriptors such as artifact type, granularity, RE activity, supported task, application domain, and language, among others. The results reveal notable imbalances, including an incomplete adoption of open-science practices, limited dataset support for elicitation activities, and a lack of language and socio-technical diversity. The resulting catalogue and characterisation scheme support informed dataset selection, comparison, and reuse, contributing to stronger empirical foundations for LLM-based RE research and evaluation.

View on arXiv PDF

Similar