Mira Mezini

SE
h-index14
34papers
502citations
Novelty41%
AI Score55

34 Papers

AIAug 14, 2024
Problem Solving Through Human-AI Preference-Based Cooperation

Subhabrata Dutta, Timo Kaufmann, Goran Glavaš et al.

While there is a widespread belief that artificial general intelligence (AGI) -- or even superhuman AI -- is imminent, complex problems in expert domains are far from being solved. We argue that such problems require human-AI cooperation and that the current state of the art in generative AI is unable to play the role of a reliable partner due to a multitude of shortcomings, including difficulty to keep track of a complex solution artifact (e.g., a software program), limited support for versatile human preference expression and lack of adapting to human preference in an interactive setting. To address these challenges, we propose HAICo2, a novel human-AI co-construction framework. We take first steps towards a formalization of HAICo2 and discuss the difficult open research problems that it faces.

AIMay 27
Efficient Post-training of LLMs for Code Generation With Offline Reinforcement Learning

Mingze Wu, Abhinav Anand, Shweta Verma et al.

Post-training using online reinforcement learning (RL) is an important training step for LLMs, including code-generating models. However, online RL for code generation involves LLM inference and verification of the generated output, which can take considerable time and resources. In this paper, we explore the application of offline RL to code-generating models by leveraging existing code datasets. Our experiments demonstrate that offline RL is an effective training strategy for improving LLM performance. We show that offline RL can be especially beneficial for small LLMs and challenging coding problems.

PLMay 6
Beyond BLEU: A Semantic Evaluation Method for Code Translation

Julius Näumann, Sven Keidel, Amir Molzam Sharifloo et al.

Code translation is one of the core capabilities of LLMs. However, evaluating the correctness of translations remains difficult, as commonly used metrics such as BLEU measure only syntactic similarity, disregarding program semantics. We propose a novel evaluation methodology for code translation tasks, emphasizing semantic equivalence over surface-level string similarity. Our approach applies established compiler testing methodology to a new domain, allowing the assessment of an LLM fine-tuned for binary lifting tasks (i.e. decompiling binaries to higher-level representations). We introduce a semantic correctness score, defined as the proportion of translations that produce correct execution outcomes, and demonstrate its application by evaluating LLM-based and heuristic decompilers. Our findings show that LLM-based approaches significantly outperform heuristic ones, while BLEU scores show negligible correlation with semantic correctness (r = -0.127 to 0.354), demonstrating that syntactic metrics fail to predict functional accuracy.

PLFeb 24
DeCo: A Core Calculus for Incremental Functional Programming with Generic Data Types

Timon Böhler, Tobias Reinhard, David Richter et al.

Incrementalization speeds up computations by avoiding unnecessary recomputations and by efficiently reusing previous results. While domain-specific techniques achieve impressive speedups, e.g., in the context of database queries, they are difficult to generalize. Meanwhile, general approaches offer little support for incrementalizing domain-specific operations. In this work, we present DeCo, a novel core calculus for incremental functional programming with support for a wide range of user-defined data types. Despite its generic nature, our approach statically incrementalizes domain-specific operations on user-defined data types. It is, hence, more fine-grained than other generic techniques which resort to treating domain-specific operations as black boxes. We mechanized our work in Lean and proved it sound, meaning incrementalized execution computes the same result as full reevaluation. We also provide an executable implementation with case studies featuring examples from linear algebra, relational algebra, dictionaries, trees, and conflict-free replicated data types, plus a brief performance evaluation on linear and relational algebra and on trees.

LGMay 20
Domain-Adaptable Reinforcement Learning for Code Generation with Dense Rewards

Erfan Aghadavoodi Jolfaei, Daniel Maninger, Abhinav Anand et al.

Large language models show strong potential for automated code generation, but lack guarantees for correctness, quality, safety, and domain-specific constraints. For instance in robotics, where code generation is increasingly being used for planning and executing actions, awareness of the environment and physical constraints is critical. To facilitate the adaption of code-generating LLMs to diverse requirements, including domain-specific ones, we present a reinforcement learning framework that fine-tunes pre-trained LLMs using proximal policy optimization. Our customizable execution-aware reward formula captures and optimizes syntax, functional correctness, code style, security, and simulator executability. A token-level reward mapping mechanism enables effective credit assignment from execution outcomes to generated tokens. The framework is evaluated on general-purpose code generation (MBPP/MBPP+) and robotic program synthesis (RoboEval). The results show substantial improvements in functional correctness and simulator executability, including an absolute pass@1 increase of 19% on MBPP and a reduction in execution failures by 51% on RoboEval. These findings demonstrate that structured reinforcement learning can effectively align language models to correct program generation and domain-specific requirements.

SENov 6, 2025
Where Do LLMs Still Struggle? An In-Depth Analysis of Code Generation Benchmarks

Amir Molzam Sharifloo, Maedeh Heydari, Parsa Kazerooni et al.

Large Language Models (LLMs) have achieved remarkable success in code generation, and the race to improve their performance has become a central focus of AI research. Benchmarks and leaderboards are increasingly popular, offering quantitative rankings of LLMs. However, they provide limited insight into the tasks that LLMs consistently fail to solve - information that is crucial for understanding current limitations and guiding the development of more capable models. To address this gap, we examined code generation tasks across four popular benchmarks, identifying those that major LLMs are most likely to fail. To understand the causes of these failures, we investigated whether the static complexity of solution code contributes to them, followed by a systematic inspection of 114 tasks that LLMs consistently struggled with. Our analysis revealed four recurring patterns of weaknesses in LLMs, as well as common complications within benchmark tasks that most often lead to failure.

SESep 24, 2025Code
Benchmarking Web API Integration Code Generation

Daniel Maninger, Leon Chemnitz, Amir Molzam Sharifloo et al.

API integration is a cornerstone of our digital infrastructure, enabling software systems to connect and interact. However, as shown by many studies, writing or generating correct code to invoke APIs, particularly web APIs, is challenging. Although large language models (LLMs) have become popular in software development, their effectiveness in automating the generation of web API integration code remains unexplored. In order to address this, we present WAPIIBench, a dataset and evaluation pipeline designed to assess the ability of LLMs to generate web API invocation code. Our experiments with several open-source LLMs reveal that generating API invocations poses a significant challenge, resulting in hallucinated endpoints, incorrect argument usage, and other errors. None of the evaluated open-source models was able to solve more than 40% of the tasks.

SEApr 21, 2025Code
Integrating Symbolic Execution into the Fine-Tuning of Code-Generating LLMs

Marina Sakharova, Abhinav Anand, Mira Mezini

Code-generating Large Language Models (LLMs) have become essential tools in modern software development, enhancing productivity and accelerating development. This paper aims to investigate the fine-tuning of code-generating LLMs using Reinforcement Learning and Direct Preference Optimization, further improving their performance. To achieve this, we enhance the training data for the reward model with the help of symbolic execution techniques, ensuring more comprehensive and objective data. With symbolic execution, we create a custom dataset that better captures the nuances in code evaluation. Our reward models, fine-tuned on this dataset, demonstrate significant improvements over the baseline, CodeRL, in estimating the quality of generated code. Our code-generating LLMs, trained with the help of reward model feedback, achieve similar results compared to the CodeRL benchmark.

SEAug 20, 2021Code
Fex: Assisted Identification of Domain Features from C Programs

Patrick Müller, Krishna Narasimhan, Mira Mezini

Modern software typically performs more than one functionality. These functionalities or features are not always organized in a way for modules representing these features to be used individually. Many software engineering approaches like programming language constructs, or product line visualization techniques have been proposed to organize projects as modules. Unfortunately, much legacy software suffer from years or decades of improper coding practices that leave the modules in the code almost undetectable. In such scenarios, a desirable requirement is to identify modules representing different features to be extracted. In this paper, we propose a novel approach that combines information retrieval and program analysis approaches to allow domain experts to identify slices of the program that represent modules using natural language search terms. We evaluate our approach by building a proof of concept tool in C, and extract modules from open source projects.

CROct 21, 2020Code
Uncovering the Hidden Dangers: Finding Unsafe Go Code in the Wild

Johannes Lauinger, Lars Baumgärtner, Anna-Katharina Wickert et al.

The Go programming language aims to provide memory and thread safety through measures such as automated memory management with garbage collection and a strict type system. However, it also offers a way of circumventing this safety net through the use of the unsafe package. While there are legitimate use cases for unsafe, developers must exercise caution to avoid introducing vulnerabilities like buffer overflows or memory corruption in general. Using go-geiger, we conducted a study on the usage of unsafe in the top 500 most popular open-source Go projects on GitHub, including a manual analysis of 1,400 code samples on how unsafe is used. From the projects using Go's module system, 38% directly contain at least one unsafe usage, and 91% contain at least one unsafe usage in the project itself or one of its transitive dependencies. Based on the usage patterns found, we present possible exploit vectors in different scenarios. Finally, we present go-safer, a novel static analysis tool to identify dangerous and common usage patterns that were previously undetected with existing tools.

CRFeb 11, 2020Code
Hidden in Plain Sight: Obfuscated Strings Threatening Your Privacy

Leonid Glanz, Patrick Müller, Lars Baumgärtner et al.

String obfuscation is an established technique used by proprietary, closed-source applications to protect intellectual property. Furthermore, it is also frequently used to hide spyware or malware in applications. In both cases, the techniques range from bit-manipulation over XOR operations to AES encryption. However, string obfuscation techniques/tools suffer from one shared weakness: They generally have to embed the necessary logic to deobfuscate strings into the app code. In this paper, we show that most of the string obfuscation techniques found in malicious and benign applications for Android can easily be broken in an automated fashion. We developed StringHound, an open-source tool that uses novel techniques that identify obfuscated strings and reconstruct the originals using slicing. We evaluated StringHound on both benign and malicious Android apps. In summary, we deobfuscate almost 30 times more obfuscated strings than other string deobfuscation tools. Additionally, we analyzed 100,000 Google Play Store apps and found multiple obfuscated strings that hide vulnerable cryptographic usages, insecure internet accesses, API keys, hard-coded passwords, and exploitation of privileges without the awareness of the developer. Furthermore, our analysis reveals that not only malware uses string obfuscation but also benign apps make extensive use of string obfuscation.

AIFeb 6
Towards Understanding What State Space Models Learn About Code

Jiali Wu, Abhinav Anand, Shweta Verma et al.

State Space Models (SSMs) have emerged as an efficient alternative to the transformer architecture. Recent studies show that SSMs can match or surpass Transformers on code understanding tasks, such as code retrieval, when trained under similar conditions. However, their internal mechanisms remain a black box. We present the first systematic analysis of what SSM-based code models actually learn and perform the first comparative analysis of SSM and Transformer-based code models. Our analysis reveals that SSMs outperform Transformers at capturing code syntax and semantics in pretraining but forgets certain syntactic and semantic relations during fine-tuning on task, especially when the task emphasizes short-range dependencies. To diagnose this, we introduce SSM-Interpret, a frequency-domain framework that exposes a spectral shift toward short-range dependencies during fine-tuning. Guided by these findings, we propose architectural modifications that significantly improve the performance of SSM-based code model, validating that our analysis directly enables better models.

LGNov 5, 2025
Adaptable Hindsight Experience Replay for Search-Based Learning

Alexandros Vazaios, Jannis Brugger, Cedric Derstroff et al.

AlphaZero-like Monte Carlo Tree Search systems, originally introduced for two-player games, dynamically balance exploration and exploitation using neural network guidance. This combination makes them also suitable for classical search problems. However, the original method of training the network with simulation results is limited in sparse reward settings, especially in the early stages, where the network cannot yet give guidance. Hindsight Experience Replay (HER) addresses this issue by relabeling unsuccessful trajectories from the search tree as supervised learning signals. We introduce Adaptable HER (\ours{}), a flexible framework that integrates HER with AlphaZero, allowing easy adjustments to HER properties such as relabeled goals, policy targets, and trajectory selection. Our experiments, including equation discovery, show that the possibility of modifying HER is beneficial and surpasses the performance of pure supervised or reinforcement learning.

SEMay 5
Deep Graph-Language Fusion for Structure-Aware Code Generation

Mert Tiftikci, Amir Molzam Sharifloo, Mira Mezini

Pre-trained Language Models (PLMs) have the potential to transform software development tasks. However, despite significant advances, current PLMs struggle to capture the structured and relational attributes of code, such as control flow and data dependencies. This limitation is rooted in an architectural mismatch: whereas code structure is best represented by graphs, transformer-based LLMs process input as sequential token patterns and therefore lack explicit structural awareness. While recent research has explored integrating graph-based code representations using techniques like graph feature extraction, retrieval-augmented generation, and prompt engineering, existing approaches suffer from information loss during dense feature extraction or prompt encoding; notably, the potential of deep, token-level fusion of graph features within model internals has not been systematically explored. In this paper, we initiate such an exploration by introducing CGFuse, a novel framework that enables token-level integration of graph-derived representations by infusing learned graph features directly into the intermediate layers of pre-trained language models. CGFuse combines a graph neural network (GNN) with a language model to explicitly preserve and exploit fine-grained structural information from code graphs, including abstract syntax trees and data-flow graphs. We systematically evaluate CGFuse across multiple LLMs, demonstrating up to 10-16% BLEU and 6-11% CodeBLEU improvements in code generation performance. These results highlight the potential of deep graph-PLM integration to advance the field toward more robust, capable AI-driven software development.

LGNov 5, 2025
Prompting Neural-Guided Equation Discovery Based on Residuals

Jannis Brugger, Viktor Pfanschilling, David Richter et al.

Neural-guided equation discovery systems use a data set as prompt and predict an equation that describes the data set without extensive search. However, if the equation does not meet the user's expectations, there are few options for getting other equation suggestions without intensive work with the system. To fill this gap, we propose Residuals for Equation Discovery (RED), a post-processing method that improves a given equation in a targeted manner, based on its residuals. By parsing the initial equation to a syntax tree, we can use node-based calculation rules to compute the residual for each subequation of the initial equation. It is then possible to use this residual as new target variable in the original data set and generate a new prompt. If, with the new prompt, the equation discovery system suggests a subequation better than the old subequation on a validation set, we replace the latter by the former. RED is usable with any equation discovery system, is fast to calculate, and is easy to extend for new mathematical operations. In experiments on 53 equations from the Feynman benchmark, we show that it not only helps to improve all tested neural-guided systems, but also all tested classical genetic programming systems.

SEDec 14, 2023
Towards Trustworthy AI Software Development Assistance

Daniel Maninger, Krishna Narasimhan, Mira Mezini

It is expected that in the near future, AI software development assistants will play an important role in the software industry. However, current software development assistants tend to be unreliable, often producing incorrect, unsafe, or low-quality code. We seek to resolve these issues by introducing a holistic architecture for constructing, training, and using trustworthy AI software development assistants. In the center of the architecture, there is a foundational LLM trained on datasets representative of real-world coding scenarios and complex software architectures, and fine-tuned on code quality criteria beyond correctness. The LLM will make use of graph-based code representations for advanced semantic comprehension. We envision a knowledge graph integrated into the system to provide up-to-date background knowledge and to enable the assistant to provide appropriate explanations. Finally, a modular framework for constrained decoding will ensure that certain guarantees (e.g., for correctness and security) hold for the generated code.

AIFeb 13, 2024
Amplifying Exploration in Monte-Carlo Tree Search by Focusing on the Unknown

Cedric Derstroff, Jannis Brugger, Jannis Blüml et al.

Monte-Carlo tree search (MCTS) is an effective anytime algorithm with a vast amount of applications. It strategically allocates computational resources to focus on promising segments of the search tree, making it a very attractive search algorithm in large search spaces. However, it often expends its limited resources on reevaluating previously explored regions when they remain the most promising path. Our proposed methodology, denoted as AmEx-MCTS, solves this problem by introducing a novel MCTS formulation. Central to AmEx-MCTS is the decoupling of value updates, visit count updates, and the selected path during the tree search, thereby enabling the exclusion of already explored subtrees or leaves. This segregation preserves the utility of visit counts for both exploration-exploitation balancing and quality metrics within MCTS. The resultant augmentation facilitates in a considerably broader search using identical computational resources, preserving the essential characteristics of MCTS. The expanded coverage not only yields more precise estimations but also proves instrumental in larger and more complex problems. Our empirical evaluation demonstrates the superior performance of AmEx-MCTS, surpassing classical MCTS and related approaches by a substantial margin.

LGJan 19
Analysis of Long Range Dependency Understanding in State Space Models

Srividya Ravikumar, Abhinav Anand, Shweta Verma et al.

Although state-space models (SSMs) have demonstrated strong performance on long-sequence benchmarks, most research has emphasized predictive accuracy rather than interpretability. In this work, we present the first systematic kernel interpretability study of the diagonalized state-space model (S4D) trained on a real-world task (vulnerability detection in source code). Through time and frequency domain analysis of the S4D kernel, we show that the long-range modeling capability of S4D varies significantly under different model architectures, affecting model performance. For instance, we show that the depending on the architecture, S4D kernel can behave as low-pass, band-pass or high-pass filter. The insights from our analysis can guide future work in designing better S4D-based models.

SEMay 2, 2025
CodeSSM: Towards State Space Models for Code Understanding

Shweta Verma, Abhinav Anand, Mira Mezini

Although transformers dominate many code-specific tasks, they have significant limitations. This paper explores State Space Models (SSMs) as a promising alternative for code understanding tasks such as retrieval, classification, and clone detection. We introduce CodeSSM, the first SSM-based model trained on code corpora to assess its effectiveness. Our results demonstrate that SSMs are more sample-efficient and can extrapolate to longer contexts beyond the pretraining length. Extensive experiments show that SSMs offer a viable alternative to transformers, addressing several their limitations. Additionally, CodeSSM reduces memory usage by up to 64\% compared to transformers at a context length of 2048, with greater savings as context length grows.

AIMar 21, 2025
Neural-Guided Equation Discovery

Jannis Brugger, Mattia Cerrato, David Richter et al.

Deep learning approaches are becoming increasingly attractive for equation discovery. We show the advantages and disadvantages of using neural-guided equation discovery by giving an overview of recent papers and the results of experiments using our modular equation discovery system MGMT ($\textbf{M}$ulti-Task $\textbf{G}$rammar-Guided $\textbf{M}$onte-Carlo $\textbf{T}$ree Search for Equation Discovery). The system uses neural-guided Monte-Carlo Tree Search (MCTS) and supports both supervised and reinforcement learning, with a search space defined by a context-free grammar. We summarize seven desirable properties of equation discovery systems, emphasizing the importance of embedding tabular data sets for such learning approaches. Using the modular structure of MGMT, we compare seven architectures (among them, RNNs, CNNs, and Transformers) for embedding tabular datasets on the auxiliary task of contrastive learning for tabular data sets on an equation discovery task. For almost all combinations of modules, supervised learning outperforms reinforcement learning. Moreover, our experiments indicate an advantage of using grammar rules as action space instead of tokens. Two adaptations of MCTS -- risk-seeking MCTS and AmEx-MCTS -- can improve equation discovery with that kind of search.

SEJun 17, 2024
A Critical Study of What Code-LLMs (Do Not) Learn

Abhinav Anand, Shweta Verma, Krishna Narasimhan et al.

Large Language Models trained on code corpora (code-LLMs) have demonstrated impressive performance in various coding assistance tasks. However, despite their increased size and training dataset, code-LLMs still have limitations such as suggesting codes with syntactic errors, variable misuse etc. Some studies argue that code-LLMs perform well on coding tasks because they use self-attention and hidden representations to encode relations among input tokens. However, previous works have not studied what code properties are not encoded by code-LLMs. In this paper, we conduct a fine-grained analysis of attention maps and hidden representations of code-LLMs. Our study indicates that code-LLMs only encode relations among specific subsets of input tokens. Specifically, by categorizing input tokens into syntactic tokens and identifiers, we found that models encode relations among syntactic tokens and among identifiers, but they fail to encode relations between syntactic tokens and identifiers. We also found that fine-tuned models encode these relations poorly compared to their pre-trained counterparts. Additionally, larger models with billions of parameters encode significantly less information about code than models with only a few hundred million parameters.

SEMay 19, 2023
Towards Code Generation from BDD Test Case Specifications: A Vision

Leon Chemnitz, David Reichenbach, Hani Aldebes et al.

Automatic code generation has recently attracted large attention and is becoming more significant to the software development process. Solutions based on Machine Learning and Artificial Intelligence are being used to increase human and software efficiency in potent and innovative ways. In this paper, we aim to leverage these developments and introduce a novel approach to generating frontend component code for the popular Angular framework. We propose to do this using behavior-driven development test specifications as input to a transformer-based machine learning model. Our approach aims to drastically reduce the development time needed for web applications while potentially increasing software quality and introducing new research ideas toward automatic code generation.

SESep 2, 2021
Python Crypto Misuses in the Wild

Anna-Katharina Wickert, Lars Baumgärtner, Florian Breitfelder et al.

Background: Previous studies have shown that up to 99.59 % of the Java apps using crypto APIs misuse the API at least once. However, these studies have been conducted on Java and C, while empirical studies for other languages are missing. For example, a controlled user study with crypto tasks in Python has shown that 68.5 % of the professional developers write a secure solution for a crypto task. Aims: To understand if this observation holds for real-world code, we conducted a study of crypto misuses in Python. Method: We developed a static analysis tool that covers common misuses of 5 different Python crypto APIs. With this analysis, we analyzed 895 popular Python projects from GitHub and 51 MicroPython projects for embedded devices. Further, we compared our results with the findings of previous studies. Results: Our analysis reveals that 52.26 % of the Python projects have at least one misuse. Further, some Python crypto libraries API design helps developers from misusing crypto functions, which were much more common in studies conducted with Java and C code. Conclusion: We conclude that we can see a positive impact of the good API design on crypto misuses for Python applications. Further, our analysis of MicroPython projects reveals the importance of hybrid analyses.

CRMay 11, 2021
Dealing with Variability in API Misuse Specification

Rodrigo Bonifacio, Stefan Krüger, Krishna Narasimhan et al.

APIs are the primary mechanism for developers to gain access to externally defined services and tools. However, previous research has revealed API misuses that violate the contract of APIs to be prevalent. Such misuses can have harmful consequences, especially in the context of cryptographic libraries. Various API misuse detectors have been proposed to address this issue including CogniCrypt, one of the most versatile of such detectors and that uses a language CrySL to specify cryptographic API usage contracts. Nonetheless, existing approaches to detect API misuse had not been designed for systematic reuse, ignoring the fact that different versions of a library, different versions of a platform, and different recommendations or guidelines might introduce variability in the correct usage of an API. Yet, little is known about how such variability impacts the specification of the correct API usage. This paper investigates this question by analyzing the impact of various sources of variability on widely used Java cryptographic libraries including JCA, Bouncy Castle, and Google Tink. The results of our investigation show that sources of variability like new versions of the API and security standards significantly impact the specifications. We then use the insights gained from our investigation to motivate an extension to the CrySL language named MetaCrySL, which builds on meta programming concepts. We evaluate MetaCrySL by specifying usage rules for a family of Android versions and illustrate that MetaCrySL can model all forms of variability we identified and drastically reduce the size of a family of specifications for the correct usage of cryptographic APIs

SEOct 9, 2020
Modular Collaborative Program Analysis in OPAL

Dominik Helm, Florian Kübler, Michael Reif et al.

Current approaches combining multiple static analyses deriving different, independent properties focus either on modularity or performance. Whereas declarative approaches facilitate modularity and automated, analysis-independent optimizations, imperative approaches foster manual, analysis-specific optimizations. In this paper, we present a novel approach to static analyses that leverages the modularity of blackboard systems and combines declarative and imperative techniques. Our approach allows exchangeability, and pluggable extension of analyses in order to improve sound(i)ness, precision, and scalability and explicitly enables the combination of otherwise incompatible analyses. With our approach integrated in the OPAL framework, we were able to implement various dissimilar analyses, including a points-to analysis that outperforms an equivalent analysis from Doop, the state-of-the-art points-to analysis framework.

CRJun 10, 2020
Mind the GAP: Security & Privacy Risks of Contact Tracing Apps

Lars Baumgärtner, Alexandra Dmitrienko, Bernd Freisleben et al.

Google and Apple have jointly provided an API for exposure notification in order to implement decentralized contract tracing apps using Bluetooth Low Energy, the so-called "Google/Apple Proposal", which we abbreviate by "GAP". We demonstrate that in real-world scenarios the current GAP design is vulnerable to (i) profiling and possibly de-anonymizing infected persons, and (ii) relay-based wormhole attacks that basically can generate fake contacts with the potential of affecting the accuracy of an app-based contact tracing system. For both types of attack, we have built tools that can easily be used on mobile phones or Raspberry Pis (e.g., Bluetooth sniffers). The goal of our work is to perform a reality check towards possibly providing empirical real-world evidence for these two privacy and security risks. We hope that our findings provide valuable input for developing secure and privacy-preserving digital contact tracing systems.

HCAug 27, 2019
Smart Street Lights and Mobile Citizen Apps for Resilient Communication in a Digital City

Lars Baumgärtner, Jonas Höchst, Patrick Lampe et al.

Currently, nearly four billion people live in urban areas. Since this trend is increasing, natural disasters or terrorist attacks in such areas affect an increasing number of people. While information and communication technology is crucial for the operation of urban infrastructures and the well-being of its inhabitants, current technology is quite vulnerable to disruptions of various kinds. In future smart cities, a more resilient urban infrastructure is imperative to handle the increasing number of hazardous situations. We present a novel resilient communication approach based on smart street lights as part of the public infrastructure. It supports people in their everyday life and adapts its functionality to the challenges of emergency situations. Our approach relies on various environmental sensors and in-situ processing for automatic situation assessment, and a range of communication mechanisms (e.g., public WiFi hotspot functionality and mesh networking) for maintaining a communication network. Furthermore, resilience is not only achieved based on infrastructure deployed by a digital city's municipality, but also based on integrating citizens through software that runs on their mobile devices (e.g., smartphones and tablets). Web-based zero-installation and platform-agnostic apps can switch to device-to-device communication to continue benefiting people even during a disaster situation. Our approach, featuring a covert channel for professional responders and the zero-installation app, is evaluated through a prototype implementation based on a commercially available street light.

SEDec 1, 2017
A Systematic Evaluation of Static API-Misuse Detectors

Sven Amann, Hoan Anh Nguyen, Sarah Nadi et al.

Application Programming Interfaces (APIs) often have usage constraints, such as restrictions on call order or call conditions. API misuses, i.e., violations of these constraints, may lead to software crashes, bugs, and vulnerabilities. Though researchers developed many API-misuse detectors over the last two decades, recent studies show that API misuses are still prevalent. Therefore, we need to understand the capabilities and limitations of existing detectors in order to advance the state of the art. In this paper, we present the first-ever qualitative and quantitative evaluation that compares static API-misuse detectors along the same dimensions, and with original author validation. To accomplish this, we develop MUC, a classification of API misuses, and MUBenchPipe, an automated benchmark for detector comparison, on top of our misuse dataset, MUBench. Our results show that the capabilities of existing detectors vary greatly and that existing detectors, though capable of detecting misuses, suffer from extremely low precision and recall. A systematic root-cause analysis reveals that, most importantly, detectors need to go beyond the naive assumption that a deviation from the most-frequent usage corresponds to a misuse and need to obtain additional usage examples to train their models. We present possible directions towards more-powerful API-misuse detectors.

SEOct 2, 2017
CrySL: Validating Correct Usage of Cryptographic APIs

Stefan Krüger, Johannes Späth, Karim Ali et al.

Various studies have empirically shown that the majority of Java and Android apps misuse cryptographic libraries, causing devastating breaches of data security. Therefore, it is crucial to detect such misuses early in the development process. The fact that insecure usages are not the exception but the norm precludes approaches based on property inference and anomaly detection. In this paper, we present CrySL, a definition language that enables cryptography experts to specify the secure usage of the cryptographic libraries that they provide. CrySL combines the generic concepts of method-call sequences and data-flow constraints with domain-specific constraints related to cryptographic algorithms and their parameters. We have implemented a compiler that translates a CrySL ruleset into a context- and flow-sensitive demand-driven static analysis. The analysis automatically checks a given Java or Android app for violations of the CrySL-encoded rules. We empirically evaluated our ruleset through analyzing 10,001 Android apps. Our results show that misuse of cryptographic APIs is still widespread, with 96% of apps containing at least one misuse. However, we observed fewer of the misuses that were reported in previous work.

SEDec 2, 2013
Abmash: Mashing Up Legacy Web Applications by Automated Imitation of Human Actions

Alper Ortac, Martin Monperrus, Mira Mezini

Many business web-based applications do not offer applications programming interfaces (APIs) to enable other applications to access their data and functions in a programmatic manner. This makes their composition difficult (for instance to synchronize data between two applications). To address this challenge, this paper presents Abmash, an approach to facilitate the integration of such legacy web applications by automatically imitating human interactions with them. By automatically interacting with the graphical user interface (GUI) of web applications, the system supports all forms of integrations including bi-directional interactions and is able to interact with AJAX-based applications. Furthermore, the integration programs are easy to write since they deal with end-user, visual user-interface elements. The integration code is simple enough to be called a "mashup".

SEJun 4, 2013
Detecting Missing Method Calls as Violations of the Majority Rule

Martin Monperrus, Mira Mezini

When using object-oriented frameworks it is easy to overlook certain important method calls that are required at particular places in code. In this paper, we provide a comprehensive set of empirical facts on this problem, starting from traces of missing method calls in a bug repository. We propose a new system that searches for missing method calls in software based on the other method calls that are observable. Our key insight is that the voting theory concept of majority rule holds for method calls: a call is likely to be missing if there is a majority of similar pieces of code where this call is present. The evaluation shows that the system predictions go further missing method calls and often reveal different kinds of code smells (e.g. violations of API best practices).

SEMay 29, 2012
What Should Developers Be Aware Of? An Empirical Study on the Directives of API Documentation

Martin Monperrus, Michael Eichberg, Elif Tekes et al.

Application Programming Interfaces (API) are exposed to developers in order to reuse software libraries. API directives are natural-language statements in API documentation that make developers aware of constraints and guidelines related to the usage of an API. This paper presents the design and the results of an empirical study on the directives of API documentation of object-oriented libraries. Its main contribution is to propose and extensively discuss a taxonomy of 23 kinds of API directives.

SEMay 29, 2012
Querying Source Code with Natural Language

Markus Kimmig, Martin Monperrus, Mira Mezini

One common task of developing or maintaining software is searching the source code for information like specific method calls or write accesses to certain fields. This kind of information is required to correctly implement new features and to solve bugs. This paper presents an approach for querying source code with natural language.

SEMar 23, 2012
Semi-Automatically Extracting FAQs to Improve Accessibility of Software Development Knowledge

Stefan Henß, Martin Monperrus, Mira Mezini

Frequently asked questions (FAQs) are a popular way to document software development knowledge. As creating such documents is expensive, this paper presents an approach for automatically extracting FAQs from sources of software development discussion, such as mailing lists and Internet forums, by combining techniques of text mining and natural language processing. We apply the approach to popular mailing lists and carry out a survey among software developers to show that it is able to extract high-quality FAQs that may be further improved by experts.