HCSep 5, 2023
Do You Trust ChatGPT? -- Perceived Credibility of Human and AI-Generated ContentMartin Huschens, Martin Briesch, Dominik Sobania et al.
This paper examines how individuals perceive the credibility of content originating from human authors versus content generated by large language models, like the GPT language model family that powers ChatGPT, in different user interface versions. Surprisingly, our results demonstrate that regardless of the user interface presentation, participants tend to attribute similar levels of credibility. While participants also do not report any different perceptions of competence and trustworthiness between human and AI-generated content, they rate AI-generated content as being clearer and more engaging. The findings from this study serve as a call for a more discerning approach to evaluating information sources, encouraging users to exercise caution and critical thinking when engaging with content generated by AI systems.
LGNov 28, 2023
Large Language Models Suffer From Their Own Output: An Analysis of the Self-Consuming Training LoopMartin Briesch, Dominik Sobania, Franz Rothlauf
Large Language Models (LLM) are already widely used to generate content for a variety of online platforms. As we are not able to safely distinguish LLM-generated content from human-produced content, LLM-generated content is used to train the next generation of LLMs, giving rise to a self-consuming training loop. From the image generation domain we know that such a self-consuming training loop reduces both quality and diversity of images finally ending in a model collapse. However, it is unclear whether this alarming effect can also be observed for LLMs. Therefore, we present the first study investigating the self-consuming training loop for LLMs. Further, we propose a novel method based on logic expressions that allows us to unambiguously verify the correctness of LLM-generated content, which is difficult for natural language text. We find that the self-consuming training loop produces correct outputs, however, the output declines in its diversity depending on the proportion of the used generated data. Fresh data can slow down this decline, but not stop it. Given these concerning results, we encourage researchers to study methods to negate this process.
LGAug 28, 2024
Robust Statistical Scaling of Outlier Scores: Improving the Quality of Outlier Probabilities for Outliers (Extended Version)Philipp Röchner, Henrique O. Marques, Ricardo J. G. B. Campello et al.
Outlier detection algorithms typically assign an outlier score to each observation in a dataset, indicating the degree to which an observation is an outlier. However, these scores are often not comparable across algorithms and can be difficult for humans to interpret. Statistical scaling addresses this problem by transforming outlier scores into outlier probabilities without using ground-truth labels, thereby improving interpretability and comparability across algorithms. However, the quality of this transformation can be different for outliers and inliers. Missing outliers in scenarios where they are of particular interest - such as healthcare, finance, or engineering - can be costly or dangerous. Thus, ensuring good probabilities for outliers is essential. This paper argues that statistical scaling, as commonly used in the literature, does not produce equally good probabilities for outliers as for inliers. Therefore, we propose robust statistical scaling, which uses robust estimators to improve the probabilities for outliers. We evaluate several variants of our method against other outlier score transformations for real-world datasets and outlier detection algorithms, where it can improve the probabilities for outliers.
LGNov 12, 2025
Transformer Semantic Genetic Programming for d-dimensional Symbolic Regression ProblemsPhilipp Anthes, Dominik Sobania, Franz Rothlauf
Transformer Semantic Genetic Programming (TSGP) is a semantic search approach that uses a pre-trained transformer model as a variation operator to generate offspring programs with controlled semantic similarity to a given parent. Unlike other semantic GP approaches that rely on fixed syntactic transformations, TSGP aims to learn diverse structural variations that lead to solutions with similar semantics. We find that a single transformer model trained on millions of programs is able to generalize across symbolic regression problems of varying dimension. Evaluated on 24 real-world and synthetic datasets, TSGP significantly outperforms standard GP, SLIM_GSGP, Deep Symbolic Regression, and Denoising Autoencoder GP, achieving an average rank of 1.58 across all benchmarks. Moreover, TSGP produces more compact solutions than SLIM_GSGP, despite its higher accuracy. In addition, the target semantic distance $\mathrm{SD}_t$ is able to control the step size in the semantic space: small values of $\mathrm{SD}_t$ enable consistent improvement in fitness but often lead to larger programs, while larger values promote faster convergence and compactness. Thus, $\mathrm{SD}_t$ provides an effective mechanism for balancing exploration and exploitation.
GNApr 4
An Imbalanced Dataset with Multiple Feature Representations for Studying Quality Control of Next-Generation SequencingPhilipp Röchner, Clarissa Krämer, Johannes U Mayer et al.
Next-generation sequencing (NGS) is a key technique for studying the DNA and RNA of organisms. However, identifying quality problems in NGS data across different experimental settings remains challenging. To develop automated quality-control tools, researchers require datasets with features that capture the characteristics of quality problems. Existing NGS repositories, however, offer only a limited number of quality-related features. To address this gap, we propose a dataset derived from 37.491 NGS samples with two types of quality-related feature representations. The first type consists of 34 features derived from quality control tools (QC-34 features). The second type has a variable number of features ranging from eight to 1.183. These features were derived from read counts in problematic genomic regions identified by the ENCODE blocklist (BL features). All features describe the same human and mouse samples from five genomic assays, allowing direct comparison of feature representations. The proposed dataset includes a binary quality label, derived from automated quality control and domain experts. Among all samples, $3.2\%$ are of low quality. Supervised machine learning algorithms accurately predicted quality labels from the features, confirming the relevance of the provided feature representations. The proposed feature representations enable researchers to study how different feature types (QC-34 vs. BL features) and granularities (varying number of BL features) affect the detection of quality problems.
CVNov 21, 2024
ComfyGI: Automatic Improvement of Image Generation WorkflowsDominik Sobania, Martin Briesch, Franz Rothlauf
Automatic image generation is no longer just of interest to researchers, but also to practitioners. However, current models are sensitive to the settings used and automatic optimization methods often require human involvement. To bridge this gap, we introduce ComfyGI, a novel approach to automatically improve workflows for image generation without the need for human intervention driven by techniques from genetic improvement. This enables image generation with significantly higher quality in terms of the alignment with the given description and the perceived aesthetics. On the performance side, we find that overall, the images generated with an optimized workflow are about 50% better compared to the initial workflow in terms of the median ImageReward score. These already good results are even surpassed in our human evaluation, as the participants preferred the images improved by ComfyGI in around 90% of the cases.
LGJul 21, 2025
We Need to Rethink Benchmarking in Anomaly DetectionPhilipp Röchner, Simon Klüttermann, Franz Rothlauf et al.
Despite the continuous proposal of new anomaly detection algorithms and extensive benchmarking efforts, progress seems to stagnate, with only minor performance differences between established baselines and new algorithms. In this position paper, we argue that this stagnation is due to limitations in how we evaluate anomaly detection algorithms. Current benchmarking does not, for example, sufficiently reflect the diversity of anomalies in applications ranging from predictive maintenance to scientific discovery. Consequently, we need to rethink benchmarking in anomaly detection. In our opinion, anomaly detection should be studied using scenarios that capture the relevant characteristics of different applications. We identify three key areas for improvement: First, we need to identify anomaly detection scenarios based on a common taxonomy. Second, anomaly detection pipelines should be analyzed end-to-end and by component. Third, evaluating anomaly detection algorithms should be meaningful regarding the scenario's objectives.
SENov 15, 2021
Choose Your Programming Copilot: A Comparison of the Program Synthesis Performance of GitHub Copilot and Genetic ProgrammingDominik Sobania, Martin Briesch, Franz Rothlauf
GitHub Copilot, an extension for the Visual Studio Code development environment powered by the large-scale language model Codex, makes automatic program synthesis available for software developers. This model has been extensively studied in the field of deep learning, however, a comparison to genetic programming, which is also known for its performance in automatic program synthesis, has not yet been carried out. In this paper, we evaluate GitHub Copilot on standard program synthesis benchmark problems and compare the achieved results with those from the genetic programming literature. In addition, we discuss the performance of both approaches. We find that the performance of the two approaches on the benchmark problems is quite similar, however, in comparison to GitHub Copilot, the program synthesis approaches based on genetic programming are not yet mature enough to support programmers in practical software development. Genetic programming usually needs a huge amount of expensive hand-labeled training cases and takes too much time to generate solutions. Furthermore, source code generated by genetic programming approaches is often bloated and difficult to understand. For future work on program synthesis with genetic programming, we suggest researchers to focus on improving the execution time, readability, and usability.
NEAug 27, 2021
Recent Developments in Program Synthesis with Evolutionary AlgorithmsDominik Sobania, Dirk Schweim, Franz Rothlauf
The automatic generation of computer programs is one of the main applications with practical relevance in the field of evolutionary computation. With program synthesis techniques not only software developers could be supported in their everyday work but even users without any programming knowledge could be empowered to automate repetitive tasks and implement their own new functionality. In recent years, many novel program synthesis approaches based on evolutionary algorithms have been proposed and evaluated on common benchmark problems. Therefore, we identify in this work the relevant evolutionary program synthesis approaches and provide an in-depth analysis of their performance. The most influential approaches we identify are stack-based, grammar-guided, as well as linear genetic programming. Further, we find that these approaches perform well on benchmark problems if there is a simple mapping from the given input to the correct output. On problems where this mapping is complex, e.g., if the problem consists of several sub-problems or requires iteration/recursion for a correct solution, results tend to be worse. Consequently, for future work, we encourage researchers not only to use a program's output for assessing the quality of a solution but also the way towards a solution (e.g., correctly solved sub-problems).
LGJun 8, 2021
The Randomness of Input Data Spaces is an A Priori Predictor for GeneralizationMartin Briesch, Dominik Sobania, Franz Rothlauf
Over-parameterized models can perfectly learn various types of data distributions, however, generalization error is usually lower for real data in comparison to artificial data. This suggests that the properties of data distributions have an impact on generalization capability. This work focuses on the search space defined by the input data and assumes that the correlation between labels of neighboring input values influences generalization. If correlation is low, the randomness of the input data space is high leading to high generalization error. We suggest to measure the randomness of an input data space using Maurer's universal. Results for synthetic classification tasks and common image classification benchmarks (MNIST, CIFAR10, and Microsoft's cats vs. dogs data set) find a high correlation between the randomness of input data spaces and the generalization error of deep neural networks for binary classification problems.
NESep 22, 2015
Deep Boltzmann Machines in Estimation of Distribution Algorithms for Combinatorial OptimizationMalte Probst, Franz Rothlauf
Estimation of Distribution Algorithms (EDAs) require flexible probability models that can be efficiently learned and sampled. Deep Boltzmann Machines (DBMs) are generative neural networks with these desired properties. We integrate a DBM into an EDA and evaluate the performance of this system in solving combinatorial optimization problems with a single objective. We compare the results to the Bayesian Optimization Algorithm. The performance of DBM-EDA was superior to BOA for difficult additively decomposable functions, i.e., concatenated deceptive traps of higher order. For most other benchmark problems, DBM-EDA cannot clearly outperform BOA, or other neural network-based EDAs. In particular, it often yields optimal solutions for a subset of the runs (with fewer evaluations than BOA), but is unable to provide reliable convergence to the global optimum competitively. At the same time, the model building process is computationally more expensive than that of other EDAs using probabilistic models from the neural network family, such as DAE-EDA.
LGSep 3, 2015
Training a Restricted Boltzmann Machine for Classification by Labeling Model SamplesMalte Probst, Franz Rothlauf
We propose an alternative method for training a classification model. Using the MNIST set of handwritten digits and Restricted Boltzmann Machines, it is possible to reach a classification performance competitive to semi-supervised learning if we first train a model in an unsupervised fashion on unlabeled data only, and then manually add labels to model samples instead of training data samples with the help of a GUI. This approach can benefit from the fact that model samples can be presented to the human labeler in a video-like fashion, resulting in a higher number of labeled examples. Also, after some initial training, hard-to-classify examples can be distinguished from easy ones automatically, saving manual work.
NENov 27, 2014
Scalability of using Restricted Boltzmann Machines for Combinatorial OptimizationMalte Probst, Franz Rothlauf, Jörn Grahl
Estimation of Distribution Algorithms (EDAs) require flexible probability models that can be efficiently learned and sampled. Restricted Boltzmann Machines (RBMs) are generative neural networks with these desired properties. We integrate an RBM into an EDA and evaluate the performance of this system in solving combinatorial optimization problems with a single objective. We assess how the number of fitness evaluations and the CPU time scale with problem size and with problem complexity. The results are compared to the Bayesian Optimization Algorithm, a state-of-the-art EDA. Although RBM-EDA requires larger population sizes and a larger number of fitness evaluations, it outperforms BOA in terms of CPU times, in particular if the problem is large or complex. RBM-EDA requires less time for model building than BOA. These results highlight the potential of using generative neural networks for combinatorial optimization.