Jürgen Cito

SE
h-index17
18papers
450citations
Novelty32%
AI Score53

18 Papers

56.5CRMay 31
Ethics Statements in Autonomous Penetration-Testing Agent Research

Andreas Happe, Jürgen Cito

Large Language Models (LLMs) have rapidly evolved over the past few years and are currently evaluated for their efficacy within the domain of offensive cyber-security. While initial forays showcase the potential of LLMs to enhance security research, they also raise critical ethical concerns regarding the dual-use of offensive security tooling. This paper analyzes a set of papers that leverage LLMs for offensive security, focusing on how ethical considerations are expressed and justified in their work. The goal is to assess the culture of AI in offensive security research regarding ethics communication, highlighting trends, best practices, and gaps in current discourse. We provide insights into how the academic community navigates the fine line between innovation and ethical responsibility. Particularly, our results show that 13 of 15 reviewed prototypes (86.6\%) mentioned ethical considerations and are thus aware of the potential dual-use of their research. Main motivation given for the research was allowing broader access to penetration-testing as well as preparing defenders for AI-guided attackers.

60.4SEMay 31Code
Reducing Token Usage of State-in-Context Agents using Minification

Nicolas Hrubec, Jürgen Cito

This paper presents a replication and extension of the recently introduced state-in-context agent framework. We independently re-implement the DirectSolve variant and evaluate it on the SWE-bench Verified benchmark. We report end-to-end full-benchmark results using GPT-5-mini and run selected ablations with GPT-4.1. In addition, we investigate a complementary research question: What is the impact of token-reducing input transformation strategies on the performance of software engineering agents? Based on a preliminary prompt analysis, we identify source code as the dominant contributor to token consumption. We therefore apply a series of code minification techniques that remove or shorten non-essential lexical elements while preserving program semantics. The proposed transformations are integrated into the agent and systematically evaluated. Experiments show that minification reduces average input token usage by 42% with a 12 percentage-point drop in resolution rate. These findings demonstrate that lightweight source code transformations can yield substantial efficiency gains while retaining a substantial fraction of the baseline performance, indicating a promising path toward more cost-effective agents. The full implementation is publicly available on GitHub: https://github.com/ipa-lab/minified-state-in-context-agent

CLJul 24, 2023
Getting pwn'd by AI: Penetration Testing with Large Language Models

Andreas Happe, Jürgen Cito

The field of software security testing, more specifically penetration testing, is an activity that requires high levels of expertise and involves many manual testing and analysis steps. This paper explores the potential usage of large-language models, such as GPT3.5, to augment penetration testers with AI sparring partners. We explore the feasibility of supplementing penetration testers with AI models for two distinct use cases: high-level task planning for security testing assignments and low-level vulnerability hunting within a vulnerable virtual machine. For the latter, we implemented a closed-feedback loop between LLM-generated low-level actions with a vulnerable virtual machine (connected through SSH) and allowed the LLM to analyze the machine state for vulnerabilities and suggest concrete attack vectors which were automatically executed within the virtual machine. We discuss promising initial results, detail avenues for improvement, and close deliberating on the ethics of providing AI-based sparring partners.

SEAug 8, 2022
Learning to Learn to Predict Performance Regressions in Production at Meta

Moritz Beller, Hongyu Li, Vivek Nair et al.

Catching and attributing code change-induced performance regressions in production is hard; predicting them beforehand, even harder. A primer on automatically learning to predict performance regressions in software, this article gives an account of the experiences we gained when researching and deploying an ML-based regression prediction pipeline at Meta. In this paper, we report on a comparative study with four ML models of increasing complexity, from (1) code-opaque, over (2) Bag of Words, (3) off-the-shelve Transformer-based, to (4) a bespoke Transformer-based model, coined SuperPerforator. Our investigation shows the inherent difficulty of the performance prediction problem, which is characterized by a large imbalance of benign onto regressing changes. Our results also call into question the general applicability of Transformer-based architectures for performance prediction: an off-the-shelve CodeBERT-based approach had surprisingly poor performance; our highly customized SuperPerforator architecture initially achieved prediction performance that was just on par with simpler Bag of Words models, and only outperformed them for down-stream use cases. This ability of SuperPerforator to transfer to an application with few learning examples afforded an opportunity to deploy it in practice at Meta: it can act as a pre-filter to sort out changes that are unlikely to introduce a regression, truncating the space of changes to search a regression in by up to 43%, a 45x improvement over a random baseline. To gain further insight into SuperPerforator, we explored it via a series of experiments computing counterfactual explanations. These highlight which parts of a code change the model deems important, thereby validating the learned black-box model.

SEJan 13, 2025Code
Evaluating Agent-based Program Repair at Google

Pat Rondon, Renyao Wei, José Cambronero et al.

Agent-based program repair offers to automatically resolve complex bugs end-to-end by combining the planning, tool use, and code generation abilities of modern LLMs. Recent work has explored the use of agent-based repair approaches on the popular open-source SWE-Bench, a collection of bugs from highly-rated GitHub Python projects. In addition, various agentic approaches such as SWE-Agent have been proposed to solve bugs in this benchmark. This paper explores the viability of using an agentic approach to address bugs in an enterprise context. To investigate this, we curate an evaluation set of 178 bugs drawn from Google's issue tracking system. This dataset spans both human-reported (78) and machine-reported bugs (100). To establish a repair performance baseline on this benchmark, we implement Passerine, an agent similar in spirit to SWE-Agent that can work within Google's development environment. We show that with 20 trajectory samples and Gemini 1.5 Pro, Passerine can produce a patch that passes bug tests (i.e., plausible) for 73% of machine-reported and 25.6% of human-reported bugs in our evaluation set. After manual examination, we found that 43% of machine-reported bugs and 17.9% of human-reported bugs have at least one patch that is semantically equivalent to the ground-truth patch. These results establish a baseline on an industrially relevant benchmark, which as we show, contains bugs drawn from a different distribution -- in terms of language diversity, size, and spread of changes, etc. -- compared to those in the popular SWE-Bench dataset.

SEJan 26
Adversarial Bug Reports as a Security Risk in Language Model-Based Automated Program Repair

Piotr Przymus, Andreas Happe, Jürgen Cito

Large Language Model (LLM) - based Automated Program Repair (APR) systems are increasingly integrated into modern software development workflows, offering automated patches in response to natural language bug reports. However, this reliance on untrusted user input introduces a novel and underexplored attack surface. In this paper, we investigate the security risks posed by adversarial bug reports -- realistic-looking issue submissions crafted to mislead APR systems into producing insecure or harmful code changes. We develop a comprehensive threat model and conduct an empirical study to evaluate the vulnerability of APR systems to such attacks. Our demonstration comprises 51 adversarial bug reports generated across a spectrum of strategies, ranging from manual curation to fully automated pipelines. We test these against a leading LLM-based APR system and assess both pre-repair defenses (e.g., LlamaGuard variants, PromptGuard variants, Granite-Guardian, and custom LLM filters) and post-repair detectors (GitHub Copilot, CodeQL). Our findings show that current defenses are insufficient: 90% of crafted bug reports triggered attacker-aligned patches. The best pre-repair filter blocked only 47%, while post-repair analysis -- often requiring human oversight -- was effective in just 58% of cases. To support scalable security testing, we introduce a prototype framework for automating the generation of adversarial bug reports. Our analysis exposes a structural asymmetry: generating adversarial inputs is inexpensive, while detecting or mitigating them remains costly and error-prone. We conclude with recommendations for improving the robustness of APR systems against adversarial misuse and highlight directions for future work on secure APR.

SEFeb 3, 2025Code
Agentic Bug Reproduction for Effective Automated Program Repair at Google

Runxiang Cheng, Michele Tufano, Jürgen Cito et al.

Bug reports often lack sufficient detail for developers to reproduce and fix the underlying defects. Bug Reproduction Tests (BRTs), tests that fail when the bug is present and pass when it has been resolved, are crucial for debugging, but they are rarely included in bug reports, both in open-source and in industrial settings. Thus, automatically generating BRTs from bug reports has the potential to accelerate the debugging process and lower time to repair. This paper investigates automated BRT generation within an industry setting, specifically at Google, focusing on the challenges of a large-scale, proprietary codebase and considering real-world industry bugs extracted from Google's internal issue tracker. We adapt and evaluate a state-of-the-art BRT generation technique, LIBRO, and present our agent-based approach, BRT Agent, which makes use of a fine-tuned Large Language Model (LLM) for code editing. Our BRT Agent significantly outperforms LIBRO, achieving a 28% plausible BRT generation rate, compared to 10% by LIBRO, on 80 human-reported bugs from Google's internal issue tracker. We further investigate the practical value of generated BRTs by integrating them with an Automated Program Repair (APR) system at Google. Our results show that providing BRTs to the APR system results in 30% more bugs with plausible fixes. Additionally, we introduce Ensemble Pass Rate (EPR), a metric which leverages the generated BRTs to select the most promising fixes from all fixes generated by APR system. Our evaluation on EPR for Top-K and threshold-based fix selections demonstrates promising results and trade-offs. For example, EPR correctly selects a plausible fix from a pool of 20 candidates in 70% of cases, based on its top-1 ranking.

49.3CRMay 12
Cochise: A Reference Harness for Autonomous Penetration Testing

Andreas Happe, Jürgen Cito

Recent work on LLM-driven autonomous penetration testing reports promising results, but existing systems often combine many architectural, prompting, and tool-integration choices, making it difficult to tell what is gained over a simple agent scaffold. We present cochise, a 597 LOC Python reference harness for autonomous penetration-testing experiments. Cochise connects an LLM-driven agent to a Linux execution host over SSH and supports controlled target environments reachable from that jump host. The prototype implements a separated Planner--Executor architecture in which long-term state is maintained outside the LLM context, while a ReAct-style executor issues commands over SSH and self-corrects based on command outputs. The scenario prompt can be adapted to different target environments. To demonstrate the efficacy of our minimal harness, we evaluate it against a live third-party testbed called Game of Active Directory (GOAD). Alongside the harness, we release replay and analysis tools: (i) cochise-replay for offline visualization of captured runs, (ii) cochise-analyze-alogs and cochise-analyze-graphs for cost, token, duration, and compromise analysis, and (iii) a corpus of JSON trajectory logs from GOAD runs, allowing researchers to study agent behavior without provisioning the 48--64 GB RAM / 190 GB storage testbed themselves. Cochise is intended not as a state-of-the-art pen-testing agent, but as reusable experimental infrastructure for comparing models, agent architectures, and penetration-testing traces.

SEMay 4, 2021Code
Interactive Static Software Performance Analysis in the IDE

Aaron Beigelbeck, Maurício Aniche, Jürgen Cito

Detecting performance issues due to suboptimal code during the development process can be a daunting task, especially when it comes to localizing them after noticing performance degradation after deployment. Static analysis has the potential to provide early feedback on performance problems to developers without having to run profilers with expensive (and often unavailable) performance tests. We develop a VSCode tool that integrates the static performance analysis results from Infer via code annotations and decorations (surfacing complexity analysis results in context) and side panel views showing details and overviews (enabling explainability of the results). Additionally, we design our system for interactivity to allow for more responsiveness to code changes as they happen. We evaluate the efficacy of our tool by measuring the overhead that the static performance analysis integration introduces in the development workflow. Further, we report on a case study that illustrates how our system can be used to reason about software performance in the context of a real performance bug in the ElasticSearch open-source project. Demo video: https://www.youtube.com/watch?v=-GqPb_YZMOs Repository: https://github.com/ipa-lab/vscode-infer-performance

LGDec 14, 2020Code
Enabling Collaborative Data Science Development with the Ballet Framework

Micah J. Smith, Jürgen Cito, Kelvin Lu et al.

While the open-source software development model has led to successful large-scale collaborations in building software systems, data science projects are frequently developed by individuals or small teams. We describe challenges to scaling data science collaborations and present a conceptual framework and ML programming model to address them. We instantiate these ideas in Ballet, a lightweight framework for collaborative, open-source data science through a focus on feature engineering, and an accompanying cloud-based development environment. Using our framework, collaborators incrementally propose feature definitions to a repository which are each subjected to an ML performance evaluation and can be automatically merged into an executable feature engineering pipeline. We leverage Ballet to conduct a case study analysis of an income prediction problem with 27 collaborators, and discuss implications for future designers of collaborative projects.

74.4CRApr 29
Enhancing Linux Privilege Escalation Attack Capabilities of Local LLM Agents

Benjamin Probst, Andreas Happe, Jürgen Cito

Recent research has demonstrated the potential of Large Language Models (LLMs) for autonomous penetration testing, particularly when using cloud-based restricted-weight models. However, reliance on such models introduces security, privacy, and sovereignty concerns, motivating the use of locally hosted open-weight alternatives. Prior work shows that small open-weight models perform poorly on automated Linux privilege escalation, limiting their practical applicability. In this paper, we present a systematic empirical study of whether targeted system-level and prompting interventions can bridge this performance gap. We analyze failure modes of open-weight models in autonomous privilege escalation, map them to established enhancement techniques, and evaluate five concrete interventions (chain-of-thought prompting, retrieval-augmented generation, structured prompts, history compression, and reflective analysis) implemented as extensions to hackingBuddyGPT. Our results show that open-weight models can match or outperform cloud-based baselines such as GPT-4o. With our treatments enabled, Llama3.1 70B exploits 83% of tested vulnerabilities, while smaller models including Llama3.1 8B and Qwen2.5 7B achieve 67% when using guidance. A full-factorial ablation study over all treatment combinations reveals that reflection-based treatments contribute most, while also identifying vulnerability discovery as a remaining bottleneck for local models.

CRApr 14, 2025
Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design

Andreas Happe, Jürgen Cito

Large Language Models (LLMs) have emerged as a powerful approach for driving offensive penetration-testing tooling. Due to the opaque nature of LLMs, empirical methods are typically used to analyze their efficacy. The quality of this analysis is highly dependent on the chosen testbed, captured metrics and analysis methods employed. This paper analyzes the methodology and benchmarking practices used for evaluating Large Language Model (LLM)-driven attacks, focusing on offensive uses of LLMs in cybersecurity. We review 19 research papers detailing 18 prototypes and their respective testbeds. We detail our findings and provide actionable recommendations for future research, emphasizing the importance of extending existing testbeds, creating baselines, and including comprehensive metrics and qualitative analysis. We also note the distinction between security research and practice, suggesting that CTF-based challenges may not fully represent real-world penetration testing scenarios.

SEFeb 2, 2022
Grammars for Free: Toward Grammar Inference for Ad Hoc Parsers

Michael Schröder, Jürgen Cito

Ad hoc parsers are everywhere: they appear any time a string is split, looped over, interpreted, transformed, or otherwise processed. Every ad hoc parser gives rise to a language: the possibly infinite set of input strings that the program accepts without going wrong. Any language can be described by a formal grammar: a finite set of rules that can generate all strings of that language. But programmers do not write grammars for ad hoc parsers -- even though they would be eminently useful. Grammars can serve as documentation, aid program comprehension, generate test inputs, and allow reasoning about language-theoretic security. We propose an automatic grammar inference system for ad hoc parsers that would enable all of these use cases, in addition to opening up new possibilities in mining software repositories and bi-directional parser synthesis.

SENov 10, 2021
Counterfactual Explanations for Models of Code

Jürgen Cito, Isil Dillig, Vijayaraghavan Murali et al.

Machine learning (ML) models play an increasingly prevalent role in many software engineering tasks. However, because most models are now powered by opaque deep neural networks, it can be difficult for developers to understand why the model came to a certain conclusion and how to act upon the model's prediction. Motivated by this problem, this paper explores counterfactual explanations for models of source code. Such counterfactual explanations constitute minimal changes to the source code under which the model "changes its mind". We integrate counterfactual explanation generation to models of source code in a real-world setting. We describe considerations that impact both the ability to find realistic and plausible counterfactual explanations, as well as the usefulness of such explanation to the user of the model. In a series of experiments we investigate the efficacy of our approach on three different models, each based on a BERT-like architecture operating over source code.

HCMar 29, 2021
Meeting in the notebook: a notebook-based environment for micro-submissions in data science collaborations

Micah J. Smith, Jürgen Cito, Kalyan Veeramachaneni

Developers in data science and other domains frequently use computational notebooks to create exploratory analyses and prototype models. However, they often struggle to incorporate existing software engineering tooling into these notebook-based workflows, leading to fragile development processes. We introduce Assemblé, a new development environment for collaborative data science projects, in which promising code fragments of data science pipelines can be contributed as pull requests to an upstream repository entirely from within JupyterLab, abstracting away low-level version control tool usage. We describe the design and implementation of Assemblé and report on a user study of 23 data scientists.

SEDec 18, 2020
An Empirical Investigation of Command-Line Customization

Michael Schröder, Jürgen Cito

The interactive command line, also known as the shell, is a prominent mechanism used extensively by a wide range of software professionals (engineers, system administrators, data scientists, etc.). Shell customizations can therefore provide insight into the tasks they repeatedly perform, how well the standard environment supports those tasks, and ways in which the environment could be productively extended or modified. To characterize the patterns and complexities of command-line customization, we mined the collective knowledge of command-line users by analyzing more than 2.2 million shell alias definitions found on GitHub. Shell aliases allow command-line users to customize their environment by defining arbitrarily complex command substitutions. Using inductive coding methods, we found three types of aliases that each enable a number of customization practices: Shortcuts (for nicknaming commands, abbreviating subcommands, and bookmarking locations), Modifications (for substituting commands, overriding defaults, colorizing output, and elevating privilege), and Scripts (for transforming data and chaining subcommands). We conjecture that identifying common customization practices can point to particular usability issues within command-line programs, and that a deeper understanding of these practices can support researchers and tool developers in designing better user experiences. In addition to our analysis, we provide an extensive reproducibility package in the form of a curated dataset together with well-documented computational notebooks enabling further knowledge discovery and a basis for learning approaches to improve command-line workflows.

SEJul 15, 2019
Characterizing Developer Use of Automatically Generated Patches

José Pablo Cambronero, Jiasi Shen, Jürgen Cito et al.

We present a study that characterizes the way developers use automatically generated patches when fixing software defects. Our study tasked two groups of developers with repairing defects in C programs. Both groups were provided with the defective line of code. One was also provided with five automatically generated and validated patches, all of which modified the defective line of code, and one of which was correct. Contrary to our initial expectations, the group with access to the generated patches did not produce more correct patches and did not produce patches in less time. We characterize the main behaviors observed in experimental subjects: a focus on understanding the defect and the relationship of the patches to the original source code. Based on this characterization, we highlight various potentially productive directions for future developer-centric automatic patch generation systems.

SESep 23, 2014
The Making of Cloud Applications An Empirical Study on Software Development for the Cloud

Jürgen Cito, Philipp Leitner, Thomas Fritz et al.

Cloud computing is gaining more and more traction as a deployment and provisioning model for software. While a large body of research already covers how to optimally operate a cloud system, we still lack insights into how professional software engineers actually use clouds, and how the cloud impacts development practices. This paper reports on the first systematic study on how software developers build applications in the cloud. We conducted a mixed-method study, consisting of qualitative interviews of 25 professional developers and a quantitative survey with 294 responses. Our results show that adopting the cloud has a profound impact throughout the software development process, as well as on how developers utilize tools and data in their daily work. Among other things, we found that (1) developers need better means to anticipate runtime problems and rigorously define metrics for improved fault localization and (2) the cloud offers an abundance of operational data, however, developers still often rely on their experience and intuition rather than utilizing metrics. From our findings, we extracted a set of guidelines for cloud development and identified challenges for researchers and tool vendors.