Dominik Sobania

LG
h-index11
9papers
290citations
Novelty42%
AI Score42

9 Papers

HCSep 5, 2023
Do You Trust ChatGPT? -- Perceived Credibility of Human and AI-Generated Content

Martin Huschens, Martin Briesch, Dominik Sobania et al.

This paper examines how individuals perceive the credibility of content originating from human authors versus content generated by large language models, like the GPT language model family that powers ChatGPT, in different user interface versions. Surprisingly, our results demonstrate that regardless of the user interface presentation, participants tend to attribute similar levels of credibility. While participants also do not report any different perceptions of competence and trustworthiness between human and AI-generated content, they rate AI-generated content as being clearer and more engaging. The findings from this study serve as a call for a more discerning approach to evaluating information sources, encouraging users to exercise caution and critical thinking when engaging with content generated by AI systems.

LGNov 28, 2023
Large Language Models Suffer From Their Own Output: An Analysis of the Self-Consuming Training Loop

Martin Briesch, Dominik Sobania, Franz Rothlauf

Large Language Models (LLM) are already widely used to generate content for a variety of online platforms. As we are not able to safely distinguish LLM-generated content from human-produced content, LLM-generated content is used to train the next generation of LLMs, giving rise to a self-consuming training loop. From the image generation domain we know that such a self-consuming training loop reduces both quality and diversity of images finally ending in a model collapse. However, it is unclear whether this alarming effect can also be observed for LLMs. Therefore, we present the first study investigating the self-consuming training loop for LLMs. Further, we propose a novel method based on logic expressions that allows us to unambiguously verify the correctness of LLM-generated content, which is difficult for natural language text. We find that the self-consuming training loop produces correct outputs, however, the output declines in its diversity depending on the proportion of the used generated data. Fresh data can slow down this decline, but not stop it. Given these concerning results, we encourage researchers to study methods to negate this process.

SEOct 18, 2023
Enhancing Genetic Improvement Mutations Using Large Language Models

Alexander E. I. Brownlee, James Callan, Karine Even-Mendoza et al.

Large language models (LLMs) have been successfully applied to software engineering tasks, including program repair. However, their application in search-based techniques such as Genetic Improvement (GI) is still largely unexplored. In this paper, we evaluate the use of LLMs as mutation operators for GI to improve the search process. We expand the Gin Java GI toolkit to call OpenAI's API to generate edits for the JCodec tool. We randomly sample the space of edits using 5 different edit types. We find that the number of patches passing unit tests is up to 75% higher with LLM-based edits than with standard Insert edits. Further, we observe that the patches found with LLMs are generally less diverse compared to standard edits. We ran GI with local search to find runtime improvements. Although many improving patches are found by LLM-enhanced GI, the best improving patch was found by standard GI.

LGNov 12, 2025
Transformer Semantic Genetic Programming for d-dimensional Symbolic Regression Problems

Philipp Anthes, Dominik Sobania, Franz Rothlauf

Transformer Semantic Genetic Programming (TSGP) is a semantic search approach that uses a pre-trained transformer model as a variation operator to generate offspring programs with controlled semantic similarity to a given parent. Unlike other semantic GP approaches that rely on fixed syntactic transformations, TSGP aims to learn diverse structural variations that lead to solutions with similar semantics. We find that a single transformer model trained on millions of programs is able to generalize across symbolic regression problems of varying dimension. Evaluated on 24 real-world and synthetic datasets, TSGP significantly outperforms standard GP, SLIM_GSGP, Deep Symbolic Regression, and Denoising Autoencoder GP, achieving an average rank of 1.58 across all benchmarks. Moreover, TSGP produces more compact solutions than SLIM_GSGP, despite its higher accuracy. In addition, the target semantic distance $\mathrm{SD}_t$ is able to control the step size in the semantic space: small values of $\mathrm{SD}_t$ enable consistent improvement in fitness but often lead to larger programs, while larger values promote faster convergence and compactness. Thus, $\mathrm{SD}_t$ provides an effective mechanism for balancing exploration and exploitation.

SEApr 10
SkillMOO: Multi-Objective Optimization of Agent Skills for Software Engineering

Jingzhi Gong, Ruizhen Gu, Zhiwei Fei et al.

Agent skills provide modular, task-specific guidance for LLM- based coding agents, but manually tuning skill bundles to balance success rate, cost, and runtime is expensive and fragile. We present SkillMOO, a multi-objective optimization framework that automatically evolves skill bundles using LLM-proposed edits and NSGA-II survivor selection: a solver agent evaluates candidate skill bundles on coding tasks and an optimizer agent proposes bundle edits based on failure analysis. On three SkillsBench software engineering tasks, SkillMOO improves pass rate by up to 131% while reducing cost up to 32% relative to the best baseline per task at low optimization overhead. Pattern analysis reveals pruning and substitution as primary drivers of improvement, suggesting effective bundles favor minimal, focused content over accumulated instructions.

CVNov 21, 2024
ComfyGI: Automatic Improvement of Image Generation Workflows

Dominik Sobania, Martin Briesch, Franz Rothlauf

Automatic image generation is no longer just of interest to researchers, but also to practitioners. However, current models are sensitive to the settings used and automatic optimization methods often require human involvement. To bridge this gap, we introduce ComfyGI, a novel approach to automatically improve workflows for image generation without the need for human intervention driven by techniques from genetic improvement. This enables image generation with significantly higher quality in terms of the alignment with the given description and the perceived aesthetics. On the performance side, we find that overall, the images generated with an optimized workflow are about 50% better compared to the initial workflow in terms of the median ImageReward score. These already good results are even surpassed in our human evaluation, as the participants preferred the images improved by ComfyGI in around 90% of the cases.

SENov 15, 2021
Choose Your Programming Copilot: A Comparison of the Program Synthesis Performance of GitHub Copilot and Genetic Programming

Dominik Sobania, Martin Briesch, Franz Rothlauf

GitHub Copilot, an extension for the Visual Studio Code development environment powered by the large-scale language model Codex, makes automatic program synthesis available for software developers. This model has been extensively studied in the field of deep learning, however, a comparison to genetic programming, which is also known for its performance in automatic program synthesis, has not yet been carried out. In this paper, we evaluate GitHub Copilot on standard program synthesis benchmark problems and compare the achieved results with those from the genetic programming literature. In addition, we discuss the performance of both approaches. We find that the performance of the two approaches on the benchmark problems is quite similar, however, in comparison to GitHub Copilot, the program synthesis approaches based on genetic programming are not yet mature enough to support programmers in practical software development. Genetic programming usually needs a huge amount of expensive hand-labeled training cases and takes too much time to generate solutions. Furthermore, source code generated by genetic programming approaches is often bloated and difficult to understand. For future work on program synthesis with genetic programming, we suggest researchers to focus on improving the execution time, readability, and usability.

NEAug 27, 2021
Recent Developments in Program Synthesis with Evolutionary Algorithms

Dominik Sobania, Dirk Schweim, Franz Rothlauf

The automatic generation of computer programs is one of the main applications with practical relevance in the field of evolutionary computation. With program synthesis techniques not only software developers could be supported in their everyday work but even users without any programming knowledge could be empowered to automate repetitive tasks and implement their own new functionality. In recent years, many novel program synthesis approaches based on evolutionary algorithms have been proposed and evaluated on common benchmark problems. Therefore, we identify in this work the relevant evolutionary program synthesis approaches and provide an in-depth analysis of their performance. The most influential approaches we identify are stack-based, grammar-guided, as well as linear genetic programming. Further, we find that these approaches perform well on benchmark problems if there is a simple mapping from the given input to the correct output. On problems where this mapping is complex, e.g., if the problem consists of several sub-problems or requires iteration/recursion for a correct solution, results tend to be worse. Consequently, for future work, we encourage researchers not only to use a program's output for assessing the quality of a solution but also the way towards a solution (e.g., correctly solved sub-problems).

LGJun 8, 2021
The Randomness of Input Data Spaces is an A Priori Predictor for Generalization

Martin Briesch, Dominik Sobania, Franz Rothlauf

Over-parameterized models can perfectly learn various types of data distributions, however, generalization error is usually lower for real data in comparison to artificial data. This suggests that the properties of data distributions have an impact on generalization capability. This work focuses on the search space defined by the input data and assumes that the correlation between labels of neighboring input values influences generalization. If correlation is low, the randomness of the input data space is high leading to high generalization error. We suggest to measure the randomness of an input data space using Maurer's universal. Results for synthetic classification tasks and common image classification benchmarks (MNIST, CIFAR10, and Microsoft's cats vs. dogs data set) find a high correlation between the randomness of input data spaces and the generalization error of deep neural networks for binary classification problems.