CLAILGJul 29, 2025

Automatic Classification of User Requirements from Online Feedback -- A Replication Study

arXiv:2507.21532v1h-index: 52025 IEEE 33rd International Requirements Engineering Conference Workshops (REW)
Originality Synthesis-oriented
AI Analysis

This work addresses the need for replication in NLP for requirements engineering, which is incremental as it builds on prior research to validate and extend findings.

The study replicated and extended a previous NLP for requirements engineering study on classifying user requirements from online feedback, finding diverse reproducibility levels across models with Naive Bayes achieving perfect reproducibility, and showing that baseline deep learning models generalized well on an external dataset while GPT-4o performed comparably to traditional models.

Natural language processing (NLP) techniques have been widely applied in the requirements engineering (RE) field to support tasks such as classification and ambiguity detection. Although RE research is rooted in empirical investigation, it has paid limited attention to replicating NLP for RE (NLP4RE) studies. The rapidly advancing realm of NLP is creating new opportunities for efficient, machine-assisted workflows, which can bring new perspectives and results to the forefront. Thus, we replicate and extend a previous NLP4RE study (baseline), "Classifying User Requirements from Online Feedback in Small Dataset Environments using Deep Learning", which evaluated different deep learning models for requirement classification from user reviews. We reproduced the original results using publicly released source code, thereby helping to strengthen the external validity of the baseline study. We then extended the setup by evaluating model performance on an external dataset and comparing results to a GPT-4o zero-shot classifier. Furthermore, we prepared the replication study ID-card for the baseline study, important for evaluating replication readiness. Results showed diverse reproducibility levels across different models, with Naive Bayes demonstrating perfect reproducibility. In contrast, BERT and other models showed mixed results. Our findings revealed that baseline deep learning models, BERT and ELMo, exhibited good generalization capabilities on an external dataset, and GPT-4o showed performance comparable to traditional baseline machine learning models. Additionally, our assessment confirmed the baseline study's replication readiness; however missing environment setup files would have further enhanced readiness. We include this missing information in our replication package and provide the replication study ID-card for our study to further encourage and support the replication of our study.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes