Livia Lestingi

h-index105

5papers

6citations

Novelty43%

AI Score44

Ranked #76,688 of 201,326 authors (top 38%)#885 in SE (top 26%)

5 Papers

CLJan 13

Do You Understand How I Feel?: Towards Verified Empathy in Therapy Chatbots

Francesco Dettori, Matteo Forasassi, Lorenzo Veronese et al.

Conversational agents are increasingly used as support tools along mental therapeutic pathways with significant societal impacts. In particular, empathy is a key non-functional requirement in therapeutic contexts, yet current chatbot development practices provide no systematic means to specify or verify it. This paper envisions a framework integrating natural language processing and formal verification to deliver empathetic therapy chatbots. A Transformer-based model extracts dialogue features, which are then translated into a Stochastic Hybrid Automaton model of dyadic therapy sessions. Empathy-related properties can then be verified through Statistical Model Checking, while strategy synthesis provides guidance for shaping agent behavior. Preliminary results show that the formal model captures therapy dynamics with good fidelity and that ad-hoc strategies improve the probability of satisfying empathy requirements.

80.7SEMay 8Code

Can LLMs Solve Science or Just Write Code? Evaluating Quantum Solver Generation

Luciano Baresi, Domenico Bianculli, Maryse Ernzer et al.

Large Language Models (LLMs) show strong capabilities in code generation, motivating their use in automated quantum solver development. However, in quantum computing, successful execution of generated code is not sufficient: correctness depends on numerically accurate results, which are sensitive to non-trivial mappings, hybrid quantum-classical workflows, and algorithm-specific approximations. This work introduces Q-SAGE, an iterative methodology to evaluate LLMs' capability in generating quantum solvers for scientific problems. The methodology adopts an iterative approach by executing the script generated by the LLM, comparing the result with the result of a classical solver, and refining the script until the two results match within a tolerance threshold. We empirically evaluated the methodology with five families of scientific problems of different complexities and five LLMs, both open source and proprietary. The results show that iterative refinement substantially improves success rates, but introduces a significant computational overhead. Moreover, as model capability increases, failure modes shift from execution errors to numerical inaccuracies, highlighting the current limitations of LLM-based quantum software.

8.7SEApr 20

Towards an Agentic LLM-based Approach to Requirement Formalization from Unstructured Specifications

Alberto Tagliaferro, Bruno Guindani, Livia Lestingi et al.

Early-stage specifications of safety-critical systems are typically expressed in natural language, making it difficult to derive formal properties suitable for verification and needed to guarantee safety. While recent Large Language Model (LLM)-based approaches can generate formal artifacts from text, they mainly focus on syntactic correctness and do not ensure semantic alignment between informal requirements and formally verifiable properties. We propose an agentic methodology that automatically extracts verification-ready properties from unstructured specifications. The modular pipeline combines requirement extraction, compatibility filtering with respect to a target formalism, and translation into formal properties. Experimental results across three scenarios show that the pipeline generates syntactically and semantically aligned formal properties with a 77.8% accuracy. By explicitly accounting for modeling and verification constraints, the approach is a paving step towards exploiting Artificial Intelligence (AI) to bridge the gap between informal descriptions and semantically meaningful formal verification.

SEJan 3, 2025

How Toxic Can You Get? Search-based Toxicity Testing for Large Language Models

Simone Corbo, Luca Bancale, Valeria De Gennaro et al.

Language is a deep-rooted means of perpetration of stereotypes and discrimination. Large Language Models (LLMs), now a pervasive technology in our everyday lives, can cause extensive harm when prone to generating toxic responses. The standard way to address this issue is to align the LLM , which, however, dampens the issue without constituting a definitive solution. Therefore, testing LLM even after alignment efforts remains crucial for detecting any residual deviations with respect to ethical standards. We present EvoTox, an automated testing framework for LLMs' inclination to toxicity, providing a way to quantitatively assess how much LLMs can be pushed towards toxic responses even in the presence of alignment. The framework adopts an iterative evolution strategy that exploits the interplay between two LLMs, the System Under Test (SUT) and the Prompt Generator steering SUT responses toward higher toxicity. The toxicity level is assessed by an automated oracle based on an existing toxicity classifier. We conduct a quantitative and qualitative empirical evaluation using five state-of-the-art LLMs as evaluation subjects having increasing complexity (7-671B parameters). Our quantitative evaluation assesses the cost-effectiveness of four alternative versions of EvoTox against existing baseline methods, based on random search, curated datasets of toxic prompts, and adversarial attacks. Our qualitative assessment engages human evaluators to rate the fluency of the generated prompts and the perceived toxicity of the responses collected during the testing sessions. Results indicate that the effectiveness, in terms of detected toxicity level, is significantly higher than the selected baseline methods (effect size up to 1.0 against random search and up to 0.99 against adversarial attacks). Furthermore, EvoTox yields a limited cost overhead (from 22% to 35% on average).

ROJul 23, 2020

Statistical Model Checking of Human-Robot Interaction Scenarios

Livia Lestingi, Mehrnoosh Askarpour, Marcello M. Bersani et al.

Robots are soon going to be deployed in non-industrial environments. Before society can take such a step, it is necessary to endow complex robotic systems with mechanisms that make them reliable enough to operate in situations where the human factor is predominant. This calls for the development of robotic frameworks that can soundly guarantee that a collection of properties are verified at all times during operation. While developing a mission plan, robots should take into account factors such as human physiology. In this paper, we present an example of how a robotic application that involves human interaction can be modeled through hybrid automata, and analyzed by using statistical model-checking. We exploit statistical techniques to determine the probability with which some properties are verified, thus easing the state-space explosion problem. The analysis is performed using the Uppaal tool. In addition, we used Uppaal to run simulations that allowed us to show non-trivial time dynamics that describe the behavior of the real system, including human-related variables. Overall, this process allows developers to gain useful insights into their application and to make decisions about how to improve it to balance efficiency and user satisfaction.