Alessandro Orso

SE
h-index10
13papers
717citations
Novelty53%
AI Score52

13 Papers

94.0SEMay 30Code
Sakura: An Approach for Generating Complex Tests from Natural Language Test Descriptions

Tyler Stennett, Rangeet Pan, Bridget McGinn et al.

Testing is a core activity in software development workflows, and research on its automation has spanned several decades. Most existing approaches generate unit tests for individual methods, validate isolated API endpoints, or target user interface (UI) layers, with non-API and non-UI automated test generators typically exercising only a single focal method. Recent empirical evidence shows a substantial gap between such generated tests and developer-written ones, which often span multiple focal methods, involve complex call sequences, and contain elaborate assertions that current automated approaches fail to capture. To address this gap, we propose generating tests from natural language (NL) descriptions of developer intent. We present Sakura, the first agent-based framework for generating structurally complex test cases from NL descriptions. Sakura decomposes NL descriptions into structured blocks and processes them using a multi-agent system consisting of a localization agent that grounds test steps in concrete application code via static analysis, a composition agent that synthesizes compilable test code and iteratively refines it using execution feedback, and a supervisor agent that coordinates agent interactions. To evaluate Sakura, we curate a novel dataset of NL test descriptions at three levels of abstraction, systematically generated from developer-written tests mined from Apache Commons projects. Across 20 applications and 1,464 test scenarios, Sakura significantly outperforms off-the-shelf agentic tools such as Gemini CLI. Specifically, Sakura achieves 50-78% higher test compilability and 38-66% higher coverage overlap with ground-truth tests compared to baselines using the same models. Moreover, Sakura paired with small open-source models such as Devstral Small 2 and Qwen3-Coder outperforms Gemini CLI using large proprietary models, while also being more cost-effective.

60.8SEMay 24Code
Hamster: A Large-Scale Study and Characterization of Developer-Written Tests

Rangeet Pan, Tyler Stennett, Raju Pavuluri et al.

Automated test generation (ATG), which aims to reduce the cost of manual test suite development, has been investigated for decades and has produced countless techniques based on a variety of approaches: symbolic analysis, search-based, random and adaptive-random, learning-based, and, most recently, large-language-model-based approaches. However, despite this large body of research, there is still a gap in our understanding of the characteristics of developer-written tests and, consequently, our assessment of how well ATG techniques and tools can generate realistic and representative tests. To bridge this gap, we conducted an extensive empirical study of developer-written tests for Java applications, covering 1.7 million test cases from open-source repositories. Our study is the first of its kind to evaluate aspects of developer-written tests that are mostly neglected in the existing literature -- including test scope, test fixtures and assertions, types of inputs, and use of mocking -- and characterize tests accordingly. Based on this characterization, we then compare existing tests with those generated by two state-of-the-art ATG tools. Our results highlight that the vast majority of developer-written tests exhibit characteristics that are beyond the capabilities of current ATG tools. Finally, based on our findings, we identify promising research directions that can help develop more effective tool support for developer testing practices. We believe this work can set the stage for additional research and bring ATG tools closer to generating the types of tests developers write.

40.3SEMay 24
SAINT: Service-level Integration Test Generation with Program Analysis and LLM-based Agents

Rangeet Pan, Raju Pavuluri, Ruikai Huang et al.

Enterprise applications are typically tested at multiple levels, with service-level testing playing an important role in validating application functionality. Existing service-level testing tools, especially for RESTful APIs, often employ fuzzing and/or depend on OpenAPI specifications which are not readily available in real-world enterprise codebases. Moreover, these tools are limited in their ability to generate functional tests that effectively exercise meaningful scenarios. In this work, we present SAINT, a novel white-box testing approach for service-level testing of enterprise Java applications. SAINT combines static analysis, large language models (LLMs), and LLM-based agents to automatically generate endpoint and scenario-based tests. The approach builds two key models: an endpoint model, capturing syntactic and semantic information about service endpoints, and an operation dependency graph, capturing inter-endpoint ordering constraints. SAINT then employs LLM-based agents to generate tests. Endpoint-focused tests aim to maximize code and database interaction coverage. Scenario-based tests are synthesized by extracting application use cases from code and refining them into executable tests via planning, action, and reflection phases of the agentic loop. We evaluated SAINT on eight Java applications, including a proprietary enterprise application. Our results illustrate the effectiveness of SAINT in coverage, fault detection, and scenario generation. Moreover, a developer survey provides strong endorsement of the scenario-based tests generated by SAINT. Overall, our work shows that combining static analysis with agentic LLM workflows enables more effective, functional, and developer-aligned service-level test generation.

LGNov 2, 2023
Learning Defect Prediction from Unrealistic Data

Kamel Alrashedy, Vincent J. Hellendoorn, Alessandro Orso

Pretrained models of code, such as CodeBERT and CodeT5, have become popular choices for code understanding and generation tasks. Such models tend to be large and require commensurate volumes of training data, which are rarely available for downstream tasks. Instead, it has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs. Models trained on such data, however, tend to only perform well on similar data, while underperforming on real world programs. In this paper, we conjecture that this discrepancy stems from the presence of distracting samples that steer the model away from the real-world task distribution. To investigate this conjecture, we propose an approach for identifying the subsets of these large yet unrealistic datasets that are most similar to examples in real-world datasets based on their learned representations. Our approach extracts high-dimensional embeddings of both real-world and artificial programs using a neural model and scores artificial samples based on their distance to the nearest real-world sample. We show that training on only the nearest, representationally most similar samples while discarding samples that are not at all similar in representations yields consistent improvements across two popular pretrained models of code on two code understanding tasks. Our results are promising, in that they show that training models on a representative subset of an unrealistic dataset can help us harness the power of large-scale synthetic data generation while preserving downstream task performance. Finally, we highlight the limitations of applying AI models for predicting vulnerabilities and bugs in real-world applications

SEMar 24, 2015Code
Automated Test Input Generation for Android: Are We There Yet?

Shauvik Roy Choudhary, Alessandra Gorla, Alessandro Orso

Mobile applications, often simply called "apps", are increasingly widespread, and we use them daily to perform a number of activities. Like all software, apps must be adequately tested to gain confidence that they behave correctly. Therefore, in recent years, researchers and practitioners alike have begun to investigate ways to automate apps testing. In particular, because of Android's open source nature and its large share of the market, a great deal of research has been performed on input generation techniques for apps that run on the Android operating systems. At this point in time, there are in fact a number of such techniques in the literature, which differ in the way they generate inputs, the strategy they use to explore the behavior of the app under test, and the specific heuristics they use. To better understand the strengths and weaknesses of these existing approaches, and get general insight on ways they could be made more effective, in this paper we perform a thorough comparison of the main existing test input generation tools for Android. In our comparison, we evaluate the effectiveness of these tools, and their corresponding techniques, according to four metrics: code coverage, ability to detect faults, ability to work on multiple platforms, and ease of use. Our results provide a clear picture of the state of the art in input generation for Android apps and identify future research directions that, if suitably investigated, could lead to more effective and efficient testing tools for Android.

SENov 11, 2024
A Multi-Agent Approach for REST API Testing with Semantic Graphs and LLM-Driven Inputs

Myeongsoo Kim, Tyler Stennett, Saurabh Sinha et al.

As modern web services increasingly rely on REST APIs, their thorough testing has become crucial. Furthermore, the advent of REST API documentation languages, such as the OpenAPI Specification, has led to the emergence of many black-box REST API testing tools. However, these tools often focus on individual test elements in isolation (e.g., APIs, parameters, values), resulting in lower coverage and less effectiveness in fault detection. To address these limitations, we present AutoRestTest, the first black-box tool to adopt a dependency-embedded multi-agent approach for REST API testing that integrates multi-agent reinforcement learning (MARL) with a semantic property dependency graph (SPDG) and Large Language Models (LLMs). Our approach treats REST API testing as a separable problem, where four agents -- API, dependency, parameter, and value agents -- collaborate to optimize API exploration. LLMs handle domain-specific value generation, the SPDG model simplifies the search space for dependencies using a similarity score between API operations, and MARL dynamically optimizes the agents' behavior. Our evaluation of AutoRestTest on 12 real-world REST services shows that it outperforms the four leading black-box REST API testing tools, including those assisted by RESTGPT (which generates realistic test inputs using LLMs), in terms of code coverage, operation coverage, and fault detection. Notably, AutoRestTest is the only tool able to trigger an internal server error in the Spotify service. Our ablation study illustrates that each component of AutoRestTest -- the SPDG, the LLM, and the agent-learning mechanism -- contributes to its overall effectiveness.

SEJan 15, 2025
LlamaRestTest: Effective REST API Testing with Small Language Models

Myeongsoo Kim, Saurabh Sinha, Alessandro Orso

Modern web services rely heavily on REST APIs, typically documented using the OpenAPI specification. The widespread adoption of this standard has resulted in the development of many black-box testing tools that generate tests based on OpenAPI specifications. Although Large Language Models (LLMs) have shown promising test-generation abilities, their application to REST API testing remains mostly unexplored. We present LlamaRestTest, a novel approach that employs two custom LLMs-created by fine-tuning and quantizing the Llama3-8B model using mined datasets of REST API example values and inter-parameter dependencies-to generate realistic test inputs and uncover inter-parameter dependencies during the testing process by analyzing server responses. We evaluated LlamaRestTest on 12 real-world services (including popular services such as Spotify), comparing it against RESTGPT, a GPT-powered specification-enhancement tool, as well as several state-of-the-art REST API testing tools, including RESTler, MoRest, EvoMaster, and ARAT-RL. Our results demonstrate that fine-tuning enables smaller models to outperform much larger models in detecting actionable parameter-dependency rules and generating valid inputs for REST API testing. We also evaluated different tool configurations, ranging from the base Llama3-8B model to fine-tuned versions, and explored multiple quantization techniques, including 2-bit, 4-bit, and 8-bit integer formats. Our study shows that small language models can perform as well as, or better than, large language models in REST API testing, balancing effectiveness and efficiency. Furthermore, LlamaRestTest outperforms state-of-the-art REST API testing tools in code coverage achieved and internal server errors identified, even when those tools use RESTGPT-enhanced specifications.

SEJan 15, 2025
AutoRestTest: A Tool for Automated REST API Testing Using LLMs and MARL

Tyler Stennett, Myeongsoo Kim, Saurabh Sinha et al.

As REST APIs have become widespread in modern web services, comprehensive testing of these APIs is increasingly crucial. Because of the vast search space of operations, parameters, and parameter values, along with their dependencies and constraints, current testing tools often achieve low code coverage, resulting in suboptimal fault detection. To address this limitation, we present AutoRestTest, a novel tool that integrates the Semantic Property Dependency Graph (SPDG) with Multi-Agent Reinforcement Learning (MARL) and large language models (LLMs) for effective REST API testing. AutoRestTest determines operation-dependent parameters using the SPDG and employs five specialized agents (operation, parameter, value, dependency, and header) to identify dependencies of operations and generate operation sequences, parameter combinations, and values. Through an intuitive command-line interface, users can easily configure and monitor tests with successful operation count, unique server errors detected, and time elapsed. Upon completion, AutoRestTest generates a detailed report highlighting errors detected and operations exercised. In this paper, we introduce our tool and present preliminary findings, with a demonstration video available at https://www.youtube.com/watch?v=VVus2W8rap8.

DBMay 20, 2021
Testing DBMS Performance with Mutations

Xinyu Liu, Qi Zhou, Joy Arulraj et al.

Because database systems are the critical component of modern data-intensive applications, it is important to ensure that they operate correctly. To this end, developers extensively test these systems to eliminate bugs that negatively affect functionality. In addition to functional bugs, however, there is another important class of bugs: performance bugs. These bugs negatively affect the response time of a database system and can therefore affect the overall performance of the system. Despite their impact on end-user experience, performance bugs have received considerably less attention than functional bugs. In this paper, we present AMOEBA, a system for automatically detecting performance bugs in database systems. The core idea behind AMOEBA is to construct query pairs that are semantically equivalent to each other and then compare their response time on the same database system. If the queries exhibit a significant difference in their runtime performance, then the root cause is likely a performance bug in the system. We propose a novel set of structure and predicate mutation rules for constructing query pairs that are likely to uncover performance bugs. We introduce feedback mechanisms for improving the efficacy and computational efficiency of the tool. We evaluate AMOEBA on two widely-used DBMSs, namely PostgreSQL and CockroachDB. AMOEBA has discovered 20 previously-unknown performance bugs, among which developers have already confirmed 14 and fixed 4.

LGFeb 15, 2019
Robustness of Neural Networks: A Probabilistic and Practical Approach

Ravi Mangal, Aditya V. Nori, Alessandro Orso

Neural networks are becoming increasingly prevalent in software, and it is therefore important to be able to verify their behavior. Because verifying the correctness of neural networks is extremely challenging, it is common to focus on the verification of other properties of these systems. One important property, in particular, is robustness. Most existing definitions of robustness, however, focus on the worst-case scenario where the inputs are adversarial. Such notions of robustness are too strong, and unlikely to be satisfied by-and verifiable for-practical neural networks. Observing that real-world inputs to neural networks are drawn from non-adversarial probability distributions, we propose a novel notion of robustness: probabilistic robustness, which requires the neural network to be robust with at least $(1 - ε)$ probability with respect to the input distribution. This probabilistic approach is practical and provides a principled way of estimating the robustness of a neural network. We also present an algorithm, based on abstract interpretation and importance sampling, for checking whether a neural network is probabilistically robust. Our algorithm uses abstract interpretation to approximate the behavior of a neural network and compute an overapproximation of the input regions that violate robustness. It then uses importance sampling to counter the effect of such overapproximation and compute an accurate estimate of the probability that the neural network violates the robustness property.

SEAug 11, 2016
From Manual Android Tests to Automated and Platform Independent Test Scripts

Mattia Fazzini, Eduardo Noronha de A. Freitas, Shauvik Roy Choudhary et al.

Because Mobile apps are extremely popular and often mission critical nowadays, companies invest a great deal of resources in testing the apps they provide to their customers. Testing is particularly important for Android apps, which must run on a multitude of devices and operating system versions. Unfortunately, as we confirmed in many interviews with quality assurance professionals, app testing is today a very human intensive, and therefore tedious and error prone, activity. To address this problem, and better support testing of Android apps, we propose a new technique that allows testers to easily create platform independent test scripts for an app and automatically run the generated test scripts on multiple devices and operating system versions. The technique does so without modifying the app under test or the runtime system, by (1) intercepting the interactions of the tester with the app and (2) providing the tester with an intuitive way to specify expected results that it then encode as test oracles. We implemented our technique in a tool named Barista and used the tool to evaluate the practical usefulness and applicability of our approach. Our results show that Barista can faithfully encode user defined test cases as test scripts with built-in oracles, generates test scripts that can run on multiple platforms, and can outperform a state-of-the-art tool with similar functionality. Barista and our experimental infrastructure are publicly available.

SESep 6, 2014
Improving Efficiency and Scalability of Formula-based Debugging

Wei Jin, Alessandro Orso

Formula-based debugging techniques are becoming increasingly popular, as they provide a principled way to identify potentially faulty statements together with information that can help fix such statements. Although effective, these approaches are computationally expensive, which limits their practical applicability. Moreover, they tend to focus on failing test cases alone, thus ignoring the wealth of information provided by passing tests. To mitigate these issues, we propose two techniques: on-demand formula computation (OFC) and clause weighting (CW). OFC improves the overall efficiency of formula-based debugging by exploring all and only the parts of a program that are relevant to a failure. CW improves the accuracy of formula-based debugging by leveraging statistical fault-localization information that accounts for passing tests. Our empirical results show that both techniques are effective and can improve the state of the art in formula-based debugging.

SEJun 6, 2013
MintHint: Automated Synthesis of Repair Hints

Shalini Kaleeswaran, Varun Tulsian, Aditya Kanade et al.

Being able to automatically repair programs is an extremely challenging task. In this paper, we present MintHint, a novel technique for program repair that is a departure from most of today's approaches. Instead of trying to fully automate program repair, which is often an unachievable goal, MintHint performs statistical correlation analysis to identify expressions that are likely to occur in the repaired code and generates, using pattern-matching based synthesis, repair hints from these expressions. Intuitively, these hints suggest how to rectify a faulty statement and help developers find a complete, actual repair. MintHint can address a variety of common faults, including incorrect, spurious, and missing expressions. We present a user study that shows that developers' productivity can improve manyfold with the use of repair hints generated by MintHint -- compared to having only traditional fault localization information. We also apply MintHint to several faults of a widely used Unix utility program to further assess the effectiveness of the approach. Our results show that MintHint performs well even in situations where (1) the repair space searched does not contain the exact repair, and (2) the operational specification obtained from the test cases for repair is incomplete or even imprecise.