Alexander Pretschner

SE
28papers
406citations
Novelty35%
AI Score53

28 Papers

12.6SEApr 30Code
Analyzing the Availability of E-Mail Addresses for PyPI Libraries

Alexandros Tsakpinis, Alexander Pretschner

Background: Open Source Software (OSS) libraries form the backbone of modern software systems, yet their long-term sustainability often depends on maintainers being reachable for support, coordination, and security reporting. Aims: In this paper, we empirically analyze the availability of contact information, specifically e-mail addresses, across 754,413 Python libraries on the Python Package Index (PyPI) and their associated GitHub repositories. Method: We examine where maintainers provide this information, assess its validity, and explore coverage across individual libraries and their dependency chains. Results: Our findings show that 79.1% of libraries include at least one valid e-mail address, with PyPI serving as the primary source (76.5%). When analyzing dependency chains, we observe that up to 97.7% of direct and 97.5% of transitive dependencies provide valid contact information. At the same time, we identify over 793,000 invalid entries, primarily due to missing fields. Conclusions: Our results indicate strong maintainer reachability, while highlighting opportunities for improvement, such as offering clearer guidance to maintainers during the packaging process and introducing opt-in validation mechanisms for existing e-mail addresses.

11.5SEApr 30Code
Forecasting the Maintained Score from the OpenSSF Scorecard: A Study of GitHub Repositories Linked to PyPI Packages

Alexandros Tsakpinis, Efe Berk Ergüleç, Emil Schwenger et al.

Background: The OpenSSF Scorecard is widely used to assess the security posture of open-source software repositories, with the Maintained metric serving as a key indicator of recent maintenance activities, helping users identify actively maintained projects and potentially abandoned dependencies. However, the metric is inherently retrospective, providing only a short-term snapshot based on the past 90 days of repository activity and offering no insight into the future. This limitation complicates risk assessment for developers and organizations that rely on open-source dependencies. Aims: In this paper, we investigate the feasibility of forecasting future maintenance activities as captured by the OpenSSF Maintained score. Method: Focusing on 3,220 GitHub repositories linked to one of the top 1% most central PyPI libraries, as ranked by PageRank, we reconstruct historical Maintained scores over a three-year period and frame the problem as a multivariate time series forecasting task. We study four target representations: the raw Maintained score (0-10), a bucketed score capturing low (0-2), moderate (3-7), and high (8-10) maintenance levels, the numerical trend slope between consecutive scores, and categorical trend types (downward, stable, upward). We compare a machine learning model (Random Forest) and a deep learning model (LSTM) using training windows of 3-12 months and forecasting horizons of 1-6 months. Results: Our results show that future maintenance activity can be forecasted with meaningful accuracy, particularly when using aggregated representations such as bucketed scores and trend types leading to accuracies above 0.95 and 0.79. Notably, simpler machine learning models perform at least on par with deep learning approaches, suggesting that effective forecasting does not require complex architectures.

39.6SEApr 30Code
Explaining Notable Metadata Practices in PyPI Libraries: An Empirical Study about Repository and Donation Platform URLs

Alexandros Tsakpinis, Nicolas Raube, Alexander Pretschner

Background: Open source software (OSS) libraries are critical components of modern software systems, yet their metadata-particularly links to source code repositories and donation platforms-is often incomplete, outdated, or inconsistent. Such deficiencies hinder dependency monitoring, security assessment, and the sustainability of OSS projects. Aims: This study aims to explain notable metadata practices in PyPI libraries, focusing on platform dominance, outdated links, and missing references to repositories and donation platforms. As this investigation relies on large-scale qualitative survey data, we further evaluate the robustness and quality of the LLM-based topic modeling approach used to derive the findings. Method: We conducted two surveys targeting PyPI authors and maintainers, collecting 1,776 open-ended responses. To analyze these responses, we developed a LLM-based topic modeling pipeline using LLaMA 3.3 70B, including preprocessing, topic extraction, and topic merging. Robustness was assessed across 30 repeated runs using Jaccard and cosine similarity, while topic quality was evaluated by 23 experts using a structured assessment framework and Randolph's Kappa. Results: The findings reveal that missing or outdated repository links are primarily associated with oversight, lack of awareness, or perceived irrelevance, while platform dominance is driven by ideological, technical, and organizational factors. Donation platform links are often omitted due to skepticism, limited perceived benefit, or lack of knowledge, and are preferentially placed on GitHub for visibility reasons. The topic modeling approach demonstrated high robustness (up to 88% lexical and 92% semantic similarity) and produced high-quality topics, with approximately 77-78% meeting all evaluation criteria and moderate inter-rater agreement.

10.9SEMay 7Code
Modeling Dependency-Propagated Ecosystem Impact of Changes in Maintenance Activities: Evaluating Support Strategies in the PyPI Network

Alexandros Tsakpinis, Emil Schwenger, Alexander Pretschner

Background: Open source software ecosystems exhibit dense dependency networks in which maintenance degradation of structurally central packages can propagate widely. Despite increasing attention to open source sustainability, existing support mechanisms lack an explicit, dependencyaware notion of ecosystem-level impact to guide support decisions. Aims: In this paper, we introduce a dependency-aware model of ecosystem impact that captures how changes in maintenance activities propagate through the Python Package Index (PyPI) ecosystem and affect its overall state. Based on this model, we prioritize packages for ecosystem support using our dependency-propagated notion of ecosystem impact. Method: Applying this framework to a snapshot of 718,750 PyPI packages and over 2 million dependencies, we compare our impact-driven support strategy with existing support mechanisms (Tidelift, Ecosyste.ms, and GitHub Sponsors) and with PageRank as a baseline measure of structural importance. Results: Our results show that a large share of the modeled ecosystem impact (approximately 80%) can be attributed to just 0.1% of all PyPI packages when prioritized based on dependency-propagated impact. In contrast, externally defined support sets vary substantially in their alignment with ecosystem impact. We further analyze maintainer reach and metadata accessibility, revealing that ecosystem impact, social footprint, and operational feasibility represent distinct but complementary dimensions of ecosystem support. Conclusions: Dependencyaware ecosystem impact modeling provides a transparent and systematic basis for prioritizing support in large-scale software ecosystems. Our findings suggest that effective support strategies, driven by ecosystem stewards, funding bodies, and organizations operating support programs, should complement existing allocation logic with impact-informed decision making.

SEMar 7, 2019Code
Compositional Fuzzing Aided by Targeted Symbolic Execution

Saahil Ognawala, Fabian Kilger, Alexander Pretschner

Guided fuzzing has, in recent years, been able to uncover many new vulnerabilities in real-world software due to its fast input mutation strategies guided by path-coverage. However, most fuzzers are unable to achieve high coverage in deeper parts of programs. Moreover, fuzzers heavily rely on the diversity of the seed inputs, often manually provided, to be able to produce meaningful results. In this paper, we present Wildfire, a novel open-source compositional fuzzing framework. Wildfire finds vulnerabilities by fuzzing isolated functions in a C-program and, then, using targeted symbolic execution it determines the feasibility of exploitation for these vulnerabilities. Based on our evaluation of 23 open-source programs (nearly 1 million LOC), we show that Wildfire, as a result of the increased coverage, finds more true-positives than baseline symbolic execution and fuzzing tools, as well as state-of-the-art coverage-guided tools, in only 10% of the analysis time taken by them. Additionally, Wildfire finds many other potential vulnerabilities whose feasibility can be determined compositionally to confirm if they are false-positives. Wildfire could also reproduce all of the known vulnerabilities and found several previously-unknown vulnerabilities in three open-source libraries.

SEJul 24, 2018Code
Automatically Assessing Vulnerabilities Discovered by Compositional Analysis

Saahil Ognawala, Ricardo Nales Amato, Alexander Pretschner et al.

Testing is the most widely employed method to find vulnerabilities in real-world software programs. Compositional analysis, based on symbolic execution, is an automated testing method to find vulnerabilities in medium- to large-scale programs consisting of many interacting components. However, existing compositional analysis frameworks do not assess the severity of reported vulnerabilities. In this paper, we present a framework to analyze vulnerabilities discovered by an existing compositional analysis tool and assign CVSS3 (Common Vulnerability Scoring System v3.0) scores to them, based on various heuristics such as interaction with related components, ease of reachability, complexity of design and likelihood of accepting unsanitized input. By analyzing vulnerabilities reported with CVSS3 scores in the past, we train simple machine learning models. By presenting our interactive framework to developers of popular open-source software and other security experts, we gather feedback on our trained models and further improve the features to increase the accuracy of our predictions. By providing qualitative (based on community feedback) and quantitative (based on prediction accuracy) evidence from 21 open-source programs, we show that our severity prediction framework can effectively assist developers with assessing vulnerabilities.

SENov 26, 2017Code
Improving Function Coverage with Munch: A Hybrid Fuzzing and Directed Symbolic Execution Approach

Saahil Ognawala, Thomas Hutzelmann, Eirini Psallida et al.

Fuzzing and symbolic execution are popular techniques for finding vulnerabilities and generating test-cases for programs. Fuzzing, a blackbox method that mutates seed input values, is generally incapable of generating diverse inputs that exercise all paths in the program. Due to the path-explosion problem and dependence on SMT solvers, symbolic execution may also not achieve high path coverage. A hybrid technique involving fuzzing and symbolic execution may achieve better function coverage than fuzzing or symbolic execution alone. In this paper, we present Munch, an open source framework implementing two hybrid techniques based on fuzzing and symbolic execution. We empirically show using nine large open-source programs that overall, Munch achieves higher (in-depth) function coverage than symbolic execution or fuzzing alone. Using metrics based on total analyses time and number of queries issued to the SMT solver, we also show that Munch is more efficient at achieving better function coverage.

10.8SEApr 27
Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study

Sivajeet Chand, Kevin Nguyen, Peter Kuntz et al.

Large language models (LLMs) perform strongly on general-purpose code generation, yet their applicability to enterprise domain-specific languages (DSLs) remains underexplored, especially for repository-scale change generation spanning multiple files and folder structures from a single natural-language (NL) instruction. We report an industrial case study at BMW that adapts code-oriented LLMs to generate and modify project-root DSL artifacts for an Xtext-based DSL that drives downstream Java/TypeScript code generation. We develop an end-to-end pipeline for dataset construction, multi-file task representation, model adaptation, and evaluation. We encode DSL folder hierarchies as structured, path-preserving JSON, allowing single-response generation at repository scale and learning cross-file dependencies. We evaluate two instruction-tuned code LLMs (Qwen2.5-Coder and DeepSeek-Coder, 7B) under three configurations: baseline prompting, one-shot in-context learning, and parameter-efficient fine-tuning (QLoRA). Beyond standard similarity metrics, we introduce task-specific measures that assess edit correctness and repository structural fidelity. Fine-tuning yields the most significant gains across models and metrics, achieving high exact-match accuracy, substantial edit similarity, and structural fidelity of 1.00 on our held-out set for multi-file outputs. At the same time, one-shot in-context learning provides smaller but consistent improvements over baseline prompting. We further validate practical utility via an expert developer survey and an execution-based check using the existing code generator.

SEJul 15, 2021
Empowered and Embedded: Ethics and Agile Processes

Niina Zuber, Severin Kacianka, Jan Gogoll et al.

In this article we focus on the structural aspects of the development of ethical software, and argue that ethical considerations need to be embedded into the (agile) software development process. In fact, we claim that agile processes of software development lend themselves specifically well for this endeavour. First, we contend that ethical evaluations need to go beyond the use of software products and include an evaluation of the software itself. This implies that software engineers influence peoples' lives through the features of their designed products. Embedded values are thus approached best by software engineers themselves. Therefore, we put emphasis on the possibility to implement ethical deliberations in already existing and well established agile software development processes. Our approach relies on software engineers making their own judgments throughout the entire development process to ensure that technical features and ethical evaluation can be addressed adequately to transport and foster desirable values and norms. We argue that agile software development processes may help the implementation of ethical deliberation for five reasons: 1) agile methods are widely spread, 2) their emphasis on flat hierarchies promotes independent thinking, 3) their reliance on existing team structures serve as an incubator for deliberation, 4) agile development enhances object-focused techno-ethical realism, and, finally, 5) agile structures provide a salient endpoint to deliberation.

SEMar 19, 2021
Trustworthy Transparency by Design

Valentin Zieglmeier, Alexander Pretschner

Individuals lack oversight over systems that process their data. This can lead to discrimination and hidden biases that are hard to uncover. Recent data protection legislation tries to tackle these issues, but it is inadequate. It does not prevent data misusage while stifling sensible use cases for data. We think the conflict between data protection and increasingly data-based systems should be solved differently. When access to data is given, all usages should be made transparent to the data subjects. This enables their data sovereignty, allowing individuals to benefit from sensible data usage while addressing potentially harmful consequences of data misusage. We contribute to this with a technical concept and an empirical evaluation. First, we conceptualize a transparency framework for software design, incorporating research on user trust and experience. Second, we instantiate and empirically evaluate the framework in a focus group study over three months, centering on the user perspective. Our transparency framework enables developing software that incorporates transparency in its design. The evaluation shows that it satisfies usability and trustworthiness requirements. The provided transparency is experienced as beneficial and participants feel empowered by it. This shows that our framework enables Trustworthy Transparency by Design.

SEJan 20, 2021
Designing Accountable Systems

Severin Kacianka, Alexander Pretschner

Accountability is an often called for property of technical systems. It is a requirement for algorithmic decision systems, autonomous cyber-physical systems, and for software systems in general. As a concept, accountability goes back to the early history of Liberalism and is suggested as a tool to limit the use of power. This long history has also given us many, often slightly differing, definitions of accountability. The problem that software developers now face is to understand what accountability means for their systems and how to reflect it in a system's design. To enable the rigorous study of accountability in a system, we need models that are suitable for capturing such a varied concept. In this paper, we present a method to express and compare different definitions of accountability using Structural Causal Models. We show how these models can be used to evaluate a system's design and present a small use case based on an autonomous car.

SENov 5, 2020
Ethics in the Software Development Process: From Codes of Conduct to Ethical Deliberation

Jan Gogoll, Niina Zuber, Severin Kacianka et al.

Software systems play an ever more important role in our lives and software engineers and their companies find themselves in a position where they are held responsible for ethical issues that may arise. In this paper, we try to disentangle ethical considerations that can be performed at the level of the software engineer from those that belong in the wider domain of business ethics. The handling of ethical problems that fall into the responsibility of the engineer have traditionally been addressed by the publication of Codes of Ethics and Conduct. We argue that these Codes are barely able to provide normative orientation in software development. The main contribution of this paper is, thus, to analyze the normative features of Codes of Ethics in software engineering and to explicate how their value-based approach might prevent their usefulness from a normative perspective. Codes of Conduct cannot replace ethical deliberation because they do not and cannot offer guidance because of their underdetermined nature. This lack of orientation, we argue, triggers reactive behavior such as "cherry-picking", "risk of indifference", "ex-post orientation" and the "desire to rely on gut feeling". In the light of this, we propose to implement ethical deliberation within software development teams as a way out.

CRJul 1, 2020
Maat: Automatically Analyzing VirusTotal for Accurate Labeling and Effective Malware Detection

Aleieldin Salem, Sebastian Banescu, Alexander Pretschner

The malware analysis and detection research community relies on the online platform VirusTotal to label Android apps based on the scan results of around 60 antiviral scanners. Unfortunately, there are no standards on how to best interpret the scan results acquired from VirusTotal, which leads to the utilization of different threshold-based labeling strategies (e.g., if ten or more scanners deem an app malicious, it is considered malicious). While some of the utilized thresholds may be able to accurately approximate the ground truths of apps, the fact that VirusTotal changes the set and versions of the scanners it uses makes such thresholds unsustainable over time. We implemented a method, Maat, that tackles these issues of standardization and sustainability by automatically generating a Machine Learning (ML)-based labeling scheme, which outperforms threshold-based labeling strategies. Using the VirusTotal scan reports of 53K Android apps that span one year, we evaluated the applicability of Maat's ML-based labeling strategies by comparing their performance against threshold-based strategies. We found that such ML-based strategies (a) can accurately and consistently label apps based on their VirusTotal scan reports, and (b) contribute to training ML-based detection methods that are more effective at classifying out-of-sample apps than their threshold-based counterparts.

AIJun 5, 2020
From Checking to Inference: Actual Causality Computations as Optimization Problems

Amjad Ibrahim, Alexander Pretschner

Actual causality is increasingly well understood. Recent formal approaches, proposed by Halpern and Pearl, have made this concept mature enough to be amenable to automated reasoning. Actual causality is especially vital for building accountable, explainable systems. Among other reasons, causality reasoning is computationally hard due to the requirements of counterfactuality and the minimality of causes. Previous approaches presented either inefficient or restricted, and domain-specific, solutions to the problem of automating causality reasoning. In this paper, we present a novel approach to formulate different notions of causal reasoning, over binary acyclic models, as optimization problems, based on quantifiable notions within counterfactual computations. We contribute and compare two compact, non-trivial, and sound integer linear programming (ILP) and Maximum Satisfiability (MaxSAT) encodings to check causality. Given a candidate cause, both approaches identify what a minimal cause is. Also, we present an ILP encoding to infer causality without requiring a candidate cause. We show that both notions are efficiently automated. Using models with more than $8000$ variables, checking is computed in a matter of seconds, with MaxSAT outperforming ILP in many cases. In contrast, inference is computed in a matter of minutes.

SEMay 7, 2020
Expressing Accountability Patterns using Structural Causal Models

Severin Kacianka, Amjad Ibrahim, Alexander Pretschner

While the exact definition and implementation of accountability depend on the specific context, at its core accountability describes a mechanism that will make decisions transparent and often provides means to sanction "bad" decisions. As such, accountability is specifically relevant for Cyber-Physical Systems, such as robots or drones, that embed themselves into a human society, take decisions and might cause lasting harm. Without a notion of accountability, such systems could behave with impunity and would not fit into society. Despite its relevance, there is currently no agreement on its meaning and, more importantly, no way to express accountability properties for these systems. As a solution we propose to express the accountability properties of systems using Structural Causal Models. They can be represented as human-readable graphical models while also offering mathematical tools to analyze and reason over them. Our central contribution is to show how Structural Causal Models can be used to express and analyze the accountability properties of systems and that this approach allows us to identify accountability patterns. These accountability patterns can be catalogued and used to improve systems and their architectures.

AIOct 31, 2019
Extending Causal Models from Machines into Humans

Severin Kacianka, Amjad Ibrahim, Alexander Pretschner et al.

Causal Models are increasingly suggested as a means to reason about the behavior of cyber-physical systems in socio-technical contexts. They allow us to analyze courses of events and reason about possible alternatives. Until now, however, such reasoning is confined to the technical domain and limited to single systems or at most groups of systems. The humans that are an integral part of any such socio-technical system are usually ignored or dealt with by "expert judgment". We show how a technical causal model can be extended with models of human behavior to cover the complexity and interplay between humans and technical systems. This integrated socio-technical causal model can then be used to reason not only about actions and decisions taken by the machine, but also about those taken by humans interacting with the system. In this paper we demonstrate the feasibility of merging causal models about machines with causal models about humans and illustrate the usefulness of this approach with a highly automated vehicle example.

CRSep 25, 2019
VirtSC: Combining Virtualization Obfuscation with Self-Checksumming

Mohsen Ahmadvand, Daniel Below, Sebastian Banescu et al.

Self-checksumming (SC) is a tamper-proofing technique that ensures certain program segments (code) in memory hash to known values at runtime. SC has few restrictions on application and hence can protect a vast majority of programs. The code verification in SC requires computation of the expected hashes after compilation, as the machine-code is not known before. This means the expected hash values need to be adjusted in the binary executable, hence combining SC with other protections is limited due to this adjustment step. However, obfuscation protections are often necessary, as SC protections can be otherwise easily detected and disabled via pattern matching. In this paper, we present a layered protection using virtualization obfuscation, yielding an architecture-agnostic SC protection that requires no post-compilation adjustment. We evaluate the performance of our scheme using a dataset of 25 real-world programs (MiBench and 3 CLI games). Our results show that the SC scheme induces an average overhead of 43% for a complete protection (100% coverage). The overhead is tolerable for less CPU-intensive programs (e.g. games) and when only parts of programs (e.g. license checking) are protected. However, large overheads stemming from the virtualization obfuscation were encountered.

AIApr 30, 2019
Efficiently Checking Actual Causality with SAT Solving

Amjad Ibrahim, Simon Rehwald, Alexander Pretschner

Recent formal approaches towards causality have made the concept ready for incorporation into the technical world. However, causality reasoning is computationally hard; and no general algorithmic approach exists that efficiently infers the causes for effects. Thus, checking causality in the context of complex, multi-agent, and distributed socio-technical systems is a significant challenge. Therefore, we conceptualize an intelligent and novel algorithmic approach towards checking causality in acyclic causal models with binary variables, utilizing the optimization power in the solvers of the Boolean Satisfiability Problem (SAT). We present two SAT encodings, and an empirical evaluation of their efficiency and scalability. We show that causality is computed efficiently in less than 5 seconds for models that consist of more than 4000 variables.

CRMar 25, 2019
Don't Pick the Cherry: An Evaluation Methodology for Android Malware Detection Methods

Aleieldin Salem, Sebastian Banescu, Alexander Pretschner

In evaluating detection methods, the malware research community relies on scan results obtained from online platforms such as VirusTotal. Nevertheless, given the lack of standards on how to interpret the obtained data to label apps, researchers hinge on their intuitions and adopt different labeling schemes. The dynamicity of VirusTotal's results along with adoption of different labeling schemes significantly affect the accuracies achieved by any given detection method even on the same dataset, which gives subjective views on the method's performance and hinders the comparison of different malware detection techniques. In this paper, we demonstrate the effect of varying (1) time, (2) labeling schemes, and (3) attack scenarios on the performance of an ensemble of Android repackaged malware detection methods, called dejavu, using over 30,000 real-world Android apps. Our results vividly show the impact of varying the aforementioned 3 dimensions on dejavu's performance. With such results, we encourage the adoption of a standard methodology that takes into account those 3 dimensions in evaluating newly-devised methods to detect Android (repackaged) malware.

CRNov 27, 2018
A Real-Time Remote IDS Testbed for Connected Vehicles

Valentin Zieglmeier, Severin Kacianka, Thomas Hutzelmann et al.

Connected vehicles are becoming commonplace. A constant connection between vehicles and a central server enables new features and services. This added connectivity raises the likelihood of exposure to attackers and risks unauthorized access. A possible countermeasure to this issue are intrusion detection systems (IDS), which aim at detecting these intrusions during or after their occurrence. The problem with IDS is the large variety of possible approaches with no sensible option for comparing them. Our contribution to this problem comprises the conceptualization and implementation of a testbed for an automotive real-world scenario. That amounts to a server-side IDS detecting intrusions into vehicles remotely. To verify the validity of our approach, we evaluate the testbed from multiple perspectives, including its fitness for purpose and the quality of the data it generates. Our evaluation shows that the testbed makes the effective assessment of various IDS possible. It solves multiple problems of existing approaches, including class imbalance. Additionally, it enables reproducibility and generating data of varying detection difficulties. This allows for comprehensive evaluation of real-time, remote IDS.

SEOct 23, 2018
Understanding and Formalizing Accountability for Cyber-Physical Systems

Severin Kacianka, Alexander Pretschner

Accountability is the property of a system that enables the uncovering of causes for events and helps understand who or what is responsible for these events. Definitions and interpretations of accountability differ; however, they are typically expressed in natural language that obscures design decisions and the impact on the overall system. This paper presents a formal model to express the accountability properties of cyber-physical systems. To illustrate the usefulness of our approach, we demonstrate how three different interpretations of accountability can be expressed using the proposed model and describe the implementation implications through a case study. This formal model can be used to highlight context specific-elements of accountability mechanisms, define their capabilities, and express different notions of accountability. In addition, it makes design decisions explicit and facilitates discussion, analysis and comparison of different approaches.

LOOct 11, 2018
Model-Based Safety and Security Engineering

Vivek Nigam, Alexander Pretschner, Harald Ruess

By exploiting the increasing surface attack of systems, cyber-attacks can cause catastrophic events, such as, remotely disable safety mechanisms. This means that in order to avoid hazards, safety and security need to be integrated, exchanging information, such as, key hazards/threats, risk evaluations, mechanisms used. This white paper describes some steps towards this integration by using models. We start by identifying some key technical challenges. Then we demonstrate how models, such as Goal Structured Notation (GSN) for safety and Attack Defense Trees (ADT) for security, can address these challenges. In particular, (1) we demonstrate how to extract in an automated fashion security relevant information from safety assessments by translating GSN-Models into ADTs; (2) We show how security results can impact the confidence of safety assessments; (3) We propose a collaborative development process where safety and security assessments are built by incrementally taking into account safety and security analysis; (4) We describe how to carry out trade-off analysis in an automated fashion, such as identifying when safety and security arguments contradict each other and how to solve such contradictions. We conclude pointing out that these are the first steps towards a wide range of techniques to support Safety and Security Engineering. As a white paper, we avoid being too technical, preferring to illustrate features by using examples and thus being more accessible.

SEMar 13, 2018
Reviewing KLEE's Sonar-Search Strategy in Context of Greybox Fuzzing

Saahil Ognawala, Alexander Pretschner, Thomas Hutzelmann et al.

Automatic test-case generation techniques of symbolic execution and fuzzing are the most widely used methods to discover vulnerabilities in, both, academia and industry. However, both these methods suffer from fundamental drawbacks that stop them from achieving high path coverage that may, consequently, lead to discovering vulnerabilities at the numerical scale of static analysis. In this presentation, we examine systems-under-test (SUTs) at the granularity level of functions and postulate that achieving higher function coverage (execution of functions in a program at least once) than, both, symbolic execution and fuzzing may be a necessary condition for discovering more vulnerabilities than both. We will start this presentation with the design of a targeted search strategy for KLEE, sonar-search, that prioritizes paths leading to a target function, rather than maximizing overall path coverage in the program. Then, we will show that examining SUTs at the level of functions (compositional analysis) leads to discovering more vulnerabilities than symbolic execution from a single entry point. Using this finding, we will, then, demonstrate a greybox fuzzing method that can achieve higher function coverage than symbolic execution. Finally, we will present a framework to effectively manage vulnerabilities and assess their severities.

AIOct 10, 2017
ACCBench: A Framework for Comparing Causality Algorithms

Simon Rehwald, Amjad Ibrahim, Kristian Beckers et al.

Modern socio-technical systems are increasingly complex. A fundamental problem is that the borders of such systems are often not well-defined a-priori, which among other problems can lead to unwanted behavior during runtime. Ideally, unwanted behavior should be prevented. If this is not possible the system shall at least be able to help determine potential cause(s) a-posterori, identify responsible parties and make them accountable for their behavior. Recently, several algorithms addressing these concepts have been proposed. However, the applicability of the corresponding approaches, specifically their effectiveness and performance, is mostly unknown. Therefore, in this paper, we propose ACCBench, a benchmark tool that allows to compare and evaluate causality algorithms under a consistent setting. Furthermore, we contribute an implementation of the two causality algorithms by Gößler and Metayer and Gößler and Astefanoaei as well as of a policy compliance approach based on some concepts of Main et al. Lastly, we conduct a case study of an Intelligent Door Control System, which exposes concrete strengths and weaknesses of all algorithms under different aspects. In the course of this, we show that the effectiveness of the algorithms in terms of cause detection as well as their performance differ to some extent. In addition, our analysis reports on some qualitative aspects that should be considered when evaluating each algorithm. For example, the human effort needed to configure the algorithm and model the use case is analyzed.

SEJan 24, 2017
One evaluation of model-based testing and its automation

Alexander Pretschner, Wolfgang Prenninger, Stefan Wagner et al.

Model-based testing relies on behavior models for the generation of model traces: input and expected output---test cases---for an implementation. We use the case study of an automotive network controller to assess different test suites in terms of error detection, model coverage, and implementation coverage. Some of these suites were generated automatically with and without models, purely at random, and with dedicated functional test selection criteria. Other suites were derived manually, with and without the model at hand. Both automatically and manually derived model-based test suites detected significantly more requirements errors than hand-crafted test suites that were directly derived from the requirements. The number of detected programming errors did not depend on the use of models. Automatically generated model-based test suites detected as many errors as hand-crafted model-based suites with the same number of tests. A sixfold increase in the number of model-based tests led to an 11% increase in detected errors.

CYAug 29, 2016
Towards a Unified Model of Accountability Infrastructures

Severin Kacianka, Florian Kelbert, Alexander Pretschner

Accountability aims to provide explanations for why unwanted situations occurred, thus providing means to assign responsibility and liability. As such, accountability has slightly different meanings across the sciences. In computer science, our focus is on providing explanations for technical systems, in particular if they interact with their physical environment using sensors and actuators and may do serious harm. Accountability is relevant when considering safety, security and privacy properties and we realize that all these incarnations are facets of the same core idea. Hence, in this paper we motivate and propose a model for accountability infrastructures that is expressive enough to capture all of these domains. At its core, this model leverages formal causality models from the literature in order to provide a solid reasoning framework. We show how this model can be instantiated for several real-world use cases.

CRFeb 11, 2015
FEEBO: An Empirical Evaluation Framework for Malware Behavior Obfuscation

Sebastian Banescu, Tobias Wüchner, Marius Guggenmos et al.

Program obfuscation is increasingly popular among malware creators. Objectively comparing different malware detection approaches with respect to their resilience against obfuscation is challenging. To the best of our knowledge, there is no common empirical framework for evaluating the resilience of malware detection approaches w.r.t. behavior obfuscation. We propose and implement such a framework that obfuscates the observable behavior of malware binaries. To assess the framework's utility, we use it to obfuscate known malware binaries and then investigate the impact on detection effectiveness of different $n$-gram based detection approaches. We find that the obfuscation transformations employed by our framework significantly affect the precision of such detection approaches. Several $n$-gram-based approaches can hence be concluded not to be resilient against this simple kind of obfuscation.

CRFeb 5, 2015
Robust and Effective Malware Detection through Quantitative Data Flow Graph Metrics

Tobias Wüchner, Martín Ochoa, Alexander Pretschner

We present a novel malware detection approach based on metrics over quantitative data flow graphs. Quantitative data flow graphs (QDFGs) model process behavior by interpreting issued system calls as aggregations of quantifiable data flows.Due to the high abstraction level we consider QDFG metric based detection more robust against typical behavior obfuscation like bogus call injection or call reordering than other common behavioral models that base on raw system calls. We support this claim with experiments on obfuscated malware logs and demonstrate the superior obfuscation robustness in comparison to detection using n-grams. Our evaluations on a large and diverse data set consisting of about 7000 malware and 500 goodware samples show an average detection rate of 98.01% and a false positive rate of 0.48%. Moreover, we show that our approach is able to detect new malware (i.e. samples from malware families not included in the training set) and that the consideration of quantities in itself significantly improves detection precision.