SEMar 17Code
Talk is Cheap, Logic is Hard: Benchmarking LLMs on Post-Condition FormalizationI. S. W. B. Prasetya, Fitsum Kifetew, Davide Prandi
Formal specifications, such as pre- and post-conditions provide a solid basis for performing thorough program verification. However, developers rarely provide such formal specifications, hence if AI could help in constructing them, it would make formal verification possible or at least make automated testing much more effective. This paper presents a study on the ability of LLMs in generating formal FULL pre- and post-conditions of a program, given its natural language description. 24 state-of-the-art LLMs were evaluated on a freshly prepared dataset of 40 tasks. The paper investigates specifications of varying difficulty and discusses a set of more refined performance metrics in addition the general accept@1 performance. It also investigates the impact of using automatically generated tests for validation of the solutions proposed by LLMs. The results of the experiment reveal that, in general LLMs can produce valid pre- and post-conditions based on natural language descriptions of programs. Incorrect solutions from proprietary models are also often near correct. A closer inspection shows that open-source models tend to result in a higher proportion of erroneous results while proprietary models tend to have slightly higher false negative rates. Interestingly, the results also show that augmenting the manually prepared dataset with automatically generated tests leads to the exposure of wrong solutions, which would have otherwise been accepted as correct. In general, LLMs perform better in formalizing pre-conditions than on post-conditions and proprietary models perform better than open ones. However, none of the LLMs were able to correctly formalize all the tasks in our benchmark. Overall, the effectiveness and reliability of LLMs in formalizing pre- and post-conditions could be greatly improved by using a good test suite that checks the correctness of the LLM generated formalizations.
SEMay 12, 2021
An Appraisal Transition System for Event-driven Emotions in Agent-based Player Experience TestingSaba Gholizadeh Ansari, I. S. W. B. Prasetya, Mehdi Dastani et al.
Player experience (PX) evaluation has become a field of interest in the game industry. Several manual PX techniques have been introduced to assist developers to understand and evaluate the experience of players in computer games. However, automated testing of player experience still needs to be addressed. An automated player experience testing framework would allow designers to evaluate the PX requirements in the early development stages without the necessity of participating human players. In this paper, we propose an automated player experience testing approach by suggesting a formal model of event-based emotions. In particular, we discuss an event-based transition system to formalize relevant emotions using Ortony, Clore, & Collins (OCC) theory of emotions. A working prototype of the model is integrated on top of Aplib, a tactical agent programming library, to create intelligent PX test agents, capable of appraising emotions in a 3D game case study. The results are graphically shown e.g. as heat maps. Emotion visualization of the test agent would ultimately help game designers in creating content that evokes a certain experience in players.
SEApr 13, 2021
An Agent-based Architecture for AI-Enhanced Automated Testing for XR Systems, a Short PaperI. S. W. B. Prasetya, Samira Shirzadehhajimahmood, Saba Gholizadeh Ansari et al.
This short paper presents an architectural overview of an agent-based framework called iv4XR for automated testing that is currently under development by an H2020 project with the same name. The framework's intended main use case of is testing the family of Extended Reality (XR) based systems (e.g. 3D games, VR sytems, AR systems), though the approach can indeed be adapted to target other types of interactive systems. The framework is unique in that it is an agent-based system. Agents are inherently reactive, and therefore are arguably a natural match to deal with interactive systems. Moreover, it is also a natural vessel for mounting and combining different AI capabilities, e.g. reasoning, navigation, and learning.
SESep 15, 2020
Navigation and Exploration in 3D-Game Automated Play TestingI. S. W. B. Prasetya, Maurin Voshol, Tom Tanis et al.
To enable automated software testing, the ability to automatically navigate to a state of interest and to explore all, or at least sufficient number of, instances of such a state is fundamental. When testing a computer game the problem has an extra dimension, namely the virtual world where the game is played on. This world often plays a dominant role in constraining which logical states are reachable, and how to reach them. So, any automated testing algorithm for computer games will inevitably need a layer that deals with navigation on a virtual world. Unlike e.g. navigating through the GUI of a typical web-based application, navigating over a virtual world is much more challenging. This paper discusses how concepts from geometry and graph-based path finding can be applied in the context of game testing to solve the problem of automated navigation and exploration. As a proof of concept, the paper also briefly discusses the implementation of the proposed approach.
SEDec 14, 2019
IMPRESS: Improving Engagement in Software Engineering Courses through GamificationTanja E. J. Vos, I. S. W. B. Prasetya, Gordon Fraser et al.
Software Engineering courses play an important role for preparing students with the right knowledge and attitude for software development in practice. The implication is far reaching, as the quality of the software that we use ultimately depends on the quality of the people that make them. Educating Software Engineering, however, is quite challenging, as the subject is not considered as most exciting by students, while teachers often have to deal with exploding number of students. The EU project IMPRESS seeks to explore the use of gamification in educating software engineering at the university level to improve students' engagement and hence their appreciation for the taught subjects. This paper presents the project, its objectives, and its current progress.
AINov 12, 2019
Aplib: Tactical Programming of Intelligent AgentsI. S. W. B. Prasetya
This paper presents aplib, a Java library for programming intelligent agents, featuring BDI and multi agency, but adding on top of it a novel layer of tactical programming inspired by the domain of theorem proving. Aplib is also implemented in such a way to provide the fluency of a Domain Specific Language (DSL). Compared to dedicated BDI agent programming languages such as JASON, 2APL, or GOAL,aplib's embedded DSL approach does mean that \aplib\ programmers will still be limited by Java syntax, but on other hand they get all the advantages that Java programmers get: rich language features (object orientation, static type checking, $λ$-expression, libraries, etc), a whole array of development tools, integration with other technologies, large community, etc.
SESep 11, 2019
Test Model Coverage Analysis under UncertaintyI. S. W. B. Prasetya, Rick Klomp
In model-based testing (MBT) we may have to deal with a non-deterministic model, e.g. because abstraction was applied, or because the software under test itself is non-deterministic. The same test case may then trigger multiple possible execution paths, depending on some internal decisions made by the software. Consequently, performing precise test analyses, e.g. to calculate the test coverage, are not possible. This can be mitigated if developers can annotate the model with estimated probabilities for taking each transition. A probabilistic model checking algorithm can subsequently be used to do simple probabilistic coverage analysis. However, in practice developers often want to know what the achieved aggregate coverage, which unfortunately cannot be re-expressed as a standard model checking problem. This paper presents an extension to allow efficient calculation of probabilistic aggregate coverage, and moreover also in combination with k-wise coverage.
SEJun 29, 2019
Model Checking a C++ Software Framework, a Case StudyJohn Lång, I. S. W. B. Prasetya
This paper presents a case study on applying two model checkers, SPIN and DIVINE, to verify key properties of a C++ software framework, known as ADAPRO, originally developed at CERN. SPIN was used for verifying properties on the design level. DIVINE was used for verifying simple test applications that interacted with the implementation. Both model checkers were found to have their own respective sets of pros and cons, but the overall experience was positive. Because both model checkers were used in a complementary manner, they provided valuable new insights into the framework, which would arguably have been hard to gain by traditional testing and analysis tools only. Translating the C++ source code into the modeling language of the SPIN model checker helped to find flaws in the original design. With DIVINE, defects were found in parts of the code base that had already been subject to hundreds of hours of unit tests, integration tests, and acceptance tests. Most importantly, model checking was found to be easy to integrate into the workflow of the software project and bring added value, not only as verification, but also validation methodology. Therefore, using model checking for developing library-level code seems realistic and worth the effort.
SEMar 1, 2019
Having Fun in Learning Formal SpecificationsI. S. W. B. Prasetya, Craig Q. H. D. Leek, Orestis Melkonian et al.
There are many benefits in providing formal specifications for our software. However, teaching students to do this is not always easy as courses on formal methods are often experienced as dry by students. This paper presents a game called FormalZ that teachers can use to introduce some variation in their class. Students can have some fun in playing the game and, while doing so, also learn the basics of writing formal specifications in the form of pre- and post-conditions. Unlike existing software engineering themed education games such as Pex and Code Defenders, FormalZ takes the deep gamification approach where playing gets a more central role in order to generate more engagement. This short paper presents our work in progress: the first implementation of FormalZ along with the result of a preliminary users' evaluation. This implementation is functionally complete and tested, but the polishing of its user interface is still future work.
SESep 15, 2018
Neural Networks as Artificial SpecificationsI. S. W. B. Prasetya, Minh An Tran
In theory, a neural network can be trained to act as an artificial specification for a program by showing it samples of the programs executions. In practice, the training turns out to be very hard. Programs often operate on discrete domains for which patterns are difficult to discern. Earlier experiments reported too much false positives. This paper revisits an experiment by Vanmali et al. by investigating several aspects that were uninvestigated in the original work: the impact of using different learning modes, aggressiveness levels, and abstraction functions. The results are quite promising.
IRApr 18, 2018
Highly Relevant Routing Recommendation Systems for Handling Few Data Using MDL Principle and Embedded Relevance Boosting FactorsDiyah Puspitaningrum, I. S. W. B. Prasetya, P. A. Wicaksono
A route recommendation system can provide better recommendation if it also takes collected user reviews into account, e.g. places that generally get positive reviews may be preferred. However, to classify sentiment, many classification algorithms existing today suffer in handling small data items such as short written reviews. In this paper we propose a model for a strongly relevant route recommendation system that is based on an MDL-based (Minimum Description Length) sentiment classification and show that such a system is capable of handling small data items (short user reviews). Another highlight of the model is the inclusion of a set of boosting factors in the relevance calculation to improve the relevance in any recommendation system that implements the model.
SEApr 11, 2018
Modeling and Testing Implementations of Protocols with Complex MessagesTom Tervoort, I. S. W. B. Prasetya
This paper presents a new language called APSL for formally describing protocols to facilitate automated testing. Many real world communication protocols exchange messages whose structures are not trivial, e.g. they may consist of multiple and nested fields, some could be optional, and some may have values that depend on other fields. To properly test implementations of such a protocol, it is not sufficient to only explore different orders of sending and receiving messages. We also need to investigate if the implementation indeed produces correctly formatted messages, and if it responds correctly when it receives different variations of every message type. APSL's main contribution is its sublanguage that is expressive enough to describe complex message formats, both text-based and binary. As an example, this paper also presents a case study where APSL is used to model and test a subset of Courier IMAP email server.
IRNov 23, 2017
Wiki-MetaSemantik: A Wikipedia-derived Query Expansion Approach based on Network PropertiesD. Puspitaningrum, G. Yulianti, I. S. W. B. Prasetya
This paper discusses the use of Wikipedia for building semantic ontologies to do Query Expansion (QE) in order to improve the search results of search engines. In this technique, selecting related Wikipedia concepts becomes important. We propose the use of network properties (degree, closeness, and pageRank) to build an ontology graph of user query concepts which is derived directly from Wikipedia structures. The resulting expansion system is called Wiki-MetaSemantik. We tested this system against other online thesauruses and ontology based QE in both individual and meta-search engines setups. Despite that our system has to build a Wikipedia ontology graph in order to do its work, the technique turns out to work very fast (1:281) compared to another ontology QE baseline (Wikipedia Persian ontology QE). It has thus the potential to be utilized online. Furthermore, it shows significant improvement in accuracy. Wiki-MetaSemantik also shows better performance in a meta-search engine (MSE) set up rather than in an individual search engine set up.