Text2Stories: Evaluating the Alignment Between Stakeholder Interviews and Generated User Stories
This addresses the largely manual task of evaluating requirement alignment for software developers and stakeholders, though it is incremental as it builds on existing LLM and embedding methods.
The paper tackles the problem of evaluating whether software requirements generated from stakeholder interviews faithfully reflect the original needs, by introducing Text2Stories, a task and metrics for text-to-story alignment. The result shows that an LLM-based matcher achieves 0.86 macro-F1 on held-out annotations, enabling scalable comparison across story sets.
Large language models (LLMs) can be employed for automating the generation of software requirements from natural language inputs such as the transcripts of elicitation interviews. However, evaluating whether those derived requirements faithfully reflect the stakeholders' needs remains a largely manual task. We introduce Text2Stories, a task and metrics for text-to-story alignment that allow quantifying the extent to which requirements (in the form of user stories) match the actual needs expressed by the elicitation session participants. Given an interview transcript and a set of user stories, our metric quantifies (i) correctness: the proportion of stories supported by the transcript, and (ii) completeness: the proportion of transcript supported by at least one story. We segment the transcript into text chunks and instantiate the alignment as a matching problem between chunks and stories. Experiments over four datasets show that an LLM-based matcher achieves 0.86 macro-F1 on held-out annotations, while embedding models alone remain behind but enable effective blocking. Finally, we show how our metrics enable the comparison across sets of stories (e.g., human vs. generated), positioning Text2Stories as a scalable, source-faithful complement to existing user-story quality criteria.