Harald Gall

SE
5papers
152citations
Novelty53%
AI Score43

5 Papers

56.6SEMay 27
Do LLMs Favor Their Providers? Measuring Vertical Integration Bias in Code Generation

Melih Catal, Alex Wolf, Tiago Ferreiro Matos et al.

Large Language Models (LLMs) have become an integral part of software development, especially with the advent of agentic capabilities. Yet, many frontier LLMs are affiliated with specific providers. This raises the question of whether generated code favors the provider's own ecosystem over comparable alternatives, potentially constraining developers' choices and increasing dependence on a single provider. We define this behavior as Vertical Integration Bias (VIB) and introduce \textsc{VIBench}, a benchmark for measuring VIB in direct and agentic code generation across $20$ provider-selectable software-integration scenarios. Evaluating $10$ frontier provider-affiliated models against $3$ non-affiliated controls, we find positive VIB in direct generation, with six of ten affiliated models showing statistically significant effects up to $+18.8$ percentage points (pp). Agentic workflows further amplify VIB, reaching $+39.2$ pp. Moreover, early affiliated-ecosystem choices in agentic workflows can persist into conceptually decoupled downstream files, with persistence as high as $90.3\%$. These findings underscore the need to measure and account for VIB in code generation, especially as agentic capabilities become more prevalent.

SEJul 31, 2021
Adversarial Robustness of Deep Code Comment Generation

Yu Zhou, Xiaoqing Zhang, Juanjuan Shen et al.

Deep neural networks (DNNs) have shown remarkable performance in a variety of domains such as computer vision, speech recognition, or natural language processing. Recently they also have been applied to various software engineering tasks, typically involving processing source code. DNNs are well-known to be vulnerable to adversarial examples, i.e., fabricated inputs that could lead to various misbehaviors of the DNN model while being perceived as benign by humans. In this paper, we focus on the code comment generation task in software engineering and study the robustness issue of the DNNs when they are applied to this task. We propose ACCENT, an identifier substitution approach to craft adversarial code snippets, which are syntactically correct and semantically close to the original code snippet, but may mislead the DNNs to produce completely irrelevant code comments. In order to improve the robustness, ACCENT also incorporates a novel training method, which can be applied to existing code comment generation models. We conduct comprehensive experiments to evaluate our approach by attacking the mainstream encoder-decoder architectures on two large-scale publicly available datasets. The results show that ACCENT efficiently produces stable attacks with functionality-preserving adversarial examples, and the generated examples have better transferability compared with baselines. We also confirm, via experiments, the effectiveness in improving model robustness with our training method.

SEFeb 4, 2020
Boosting API Recommendation with Implicit Feedback

Yu Zhou, Xinying Yang, Taolue Chen et al.

Developers often need to use appropriate APIs to program efficiently, but it is usually a difficult task to identify the exact one they need from a vast of candidates. To ease the burden, a multitude of API recommendation approaches have been proposed. However, most of the currently available API recommenders do not support the effective integration of users' feedback into the recommendation loop. In this paper, we propose a framework, BRAID (Boosting RecommendAtion with Implicit FeeDback), which leverages learning-to-rank and active learning techniques to boost recommendation performance. By exploiting users' feedback information, we train a learning-to-rank model to re-rank the recommendation results. In addition, we speed up the feedback learning process with active learning. Existing query-based API recommendation approaches can be plugged into BRAID. We select three state-of-the-art API recommendation approaches as baselines to demonstrate the performance enhancement of BRAID measured by Hit@k (Top-k), MAP, and MRR. Empirical experiments show that, with acceptable overheads, the recommendation performance improves steadily and substantially with the increasing percentage of feedback data, comparing with the baselines.

SEMar 3, 2019
User Review-Based Change File Localization for Mobile Applications

Yu Zhou, Yanqi Su, Taolue Chen et al.

In the current mobile app development, novel and emerging DevOps practices (e.g., Continuous Delivery, Integration, and user feedback analysis) and tools are becoming more widespread. For instance, the integration of user feedback (provided in the form of user reviews) in the software release cycle represents a valuable asset for the maintenance and evolution of mobile apps. To fully make use of these assets, it is highly desirable for developers to establish semantic links between the user reviews and the software artefacts to be changed (e.g., source code and documentation), and thus to localize the potential files to change for addressing the user feedback. In this paper, we propose RISING (Review Integration via claSsification, clusterIng, and linkiNG), an automated approach to support the continuous integration of user feedback via classification, clustering, and linking of user reviews. RISING leverages domain-specific constraint information and semi-supervised learning to group user reviews into multiple fine-grained clusters concerning similar users' requests. Then, by combining the textual information from both commit messages and source code, it automatically localizes potential change files to accommodate the users' requests. Our empirical studies demonstrate that the proposed approach outperforms the state-of-the-art baseline work in terms of clustering and localization accuracy, and thus produces more reliable results.

SEAug 20, 2014
Cloud WorkBench - Infrastructure-as-Code Based Cloud Benchmarking

Joel Scheuner, Philipp Leitner, Jurgen Cito et al.

To optimally deploy their applications, users of Infrastructure-as-a-Service clouds are required to evaluate the costs and performance of different combinations of cloud configurations to find out which combination provides the best service level for their specific application. Unfortunately, benchmarking cloud services is cumbersome and error-prone. In this paper, we propose an architecture and concrete implementation of a cloud benchmarking Web service, which fosters the definition of reusable and representative benchmarks. In distinction to existing work, our system is based on the notion of Infrastructure-as-Code, which is a state of the art concept to define IT infrastructure in a reproducible, well-defined, and testable way. We demonstrate our system based on an illustrative case study, in which we measure and compare the disk IO speeds of different instance and storage types in Amazon EC2.