CLFeb 12, 2023
ASR Bundestag: A Large-Scale political debate dataset in GermanJohannes Wirth, René Peinl
We present ASR Bundestag, a dataset for automatic speech recognition in German, consisting of 610 hours of aligned audio-transcript pairs for supervised training as well as 1,038 hours of unlabeled audio snippets for self-supervised learning, based on raw audio data and transcriptions from plenary sessions and committee meetings of the German parliament. In addition, we discuss utilized approaches for the automated creation of speech datasets and assess the quality of the resulting dataset based on evaluations and finetuning of a pre-trained state of the art model. We make the dataset publicly available, including all subsets.
27.6CVMay 12
Unlocking UML Class Diagram Understanding in Vision Language ModelsArtem Naboichenko, René Peinl
Although Vision Language Models (VLMs) have seen tremendous progress across all kinds of use cases, they still fall behind in answering questions regard-ing diagrams compared to photos. Although progress has been made in the area of bar charts, line charts and other diagrams like that there is still few research concerned with other types of diagrams, e.g. in the computer science domain. Our work presents a benchmark for visual question answering based on UML class diagrams which is both challenging and manageable. We further construct a large-scale training dataset with 16.000 image-question-answer triples and show that a LoRA-based finetune easily outperforms Qwen 3.5 27B, which is a recent and well-performing VLM in many other benchmarks.
CLMay 19, 2023Code
Evaluation of medium-large Language Models at zero-shot closed book generative question answeringRené Peinl, Johannes Wirth
Large language models (LLMs) have garnered significant attention, but the definition of "large" lacks clarity. This paper focuses on medium-sized language models (MLMs), defined as having at least six billion parameters but less than 100 billion. The study evaluates MLMs regarding zero-shot generative question answering, which requires models to provide elaborate answers without external document retrieval. The paper introduces an own test dataset and presents results from human evaluation. Results show that combining the best answers from different MLMs yielded an overall correct answer rate of 82.7% which is better than the 60.9% of ChatGPT. The best MLM achieved 71.8% and has 33B parameters, which highlights the importance of using appropriate training data for fine-tuning rather than solely relying on the number of parameters. More fine-grained feedback should be used to further improve the quality of answers. The open source community is quickly closing the gap to the best commercial models.
SDJun 11, 2021Code
HUI-Audio-Corpus-German: A high quality TTS datasetPascal Puchtler, Johannes Wirth, René Peinl
The increasing availability of audio data on the internet lead to a multitude of datasets for development and training of text to speech applications, based on neural networks. Highly differing quality of voice, low sampling rates, lack of text normalization and disadvantageous alignment of audio samples to corresponding transcript sentences still limit the performance of deep neural networks trained on this task. Additionally, data resources in languages like German are still very limited. We introduce the "HUI-Audio-Corpus-German", a large, open-source dataset for TTS engines, created with a processing pipeline, which produces high quality audio to transcription alignments and decreases manual effort needed for creation.
CLApr 15, 2025
Benchmarking Vision Language Models on German Factual DataRené Peinl, Vincent Tischler
Similar to LLMs, the development of vision language models is mainly driven by English datasets and models trained in English and Chinese language, whereas support for other languages, even those considered high-resource languages such as German, remains significantly weaker. In this work we present an analysis of open-weight VLMs on factual knowledge in the German and English language. We disentangle the image-related aspects from the textual ones by analyzing accu-racy with jury-as-a-judge in both prompt languages and images from German and international contexts. We found that for celebrities and sights, VLMs struggle because they are lacking visual cognition of German image contents. For animals and plants, the tested models can often correctly identify the image contents ac-cording to the scientific name or English common name but fail in German lan-guage. Cars and supermarket products were identified equally well in English and German images across both prompt languages.
AIJun 13, 2025
VLM@school -- Evaluation of AI image understanding on German middle school knowledgeRené Peinl, Vincent Tischler
This paper introduces a novel benchmark dataset designed to evaluate the capabilities of Vision Language Models (VLMs) on tasks that combine visual reasoning with subject-specific background knowledge in the German language. In contrast to widely used English-language benchmarks that often rely on artificially difficult or decontextualized problems, this dataset draws from real middle school curricula across nine domains including mathematics, history, biology, and religion. The benchmark includes over 2,000 open-ended questions grounded in 486 images, ensuring that models must integrate visual interpretation with factual reasoning rather than rely on superficial textual cues. We evaluate thirteen state-of-the-art open-weight VLMs across multiple dimensions, including domain-specific accuracy and performance on adversarial crafted questions. Our findings reveal that even the strongest models achieve less than 45% overall accuracy, with particularly poor performance in music, mathematics, and adversarial settings. Furthermore, the results indicate significant discrepancies between success on popular benchmarks and real-world multimodal understanding. We conclude that middle school-level tasks offer a meaningful and underutilized avenue for stress-testing VLMs, especially in non-English contexts. The dataset and evaluation protocol serve as a rigorous testbed to better understand and improve the visual and linguistic reasoning capabilities of future AI systems.
CLApr 15, 2025
Using LLMs as prompt modifier to avoid biases in AI image generatorsRené Peinl
This study examines how Large Language Models (LLMs) can reduce biases in text-to-image generation systems by modifying user prompts. We define bias as a model's unfair deviation from population statistics given neutral prompts. Our experiments with Stable Diffusion XL, 3.5 and Flux demonstrate that LLM-modified prompts significantly increase image diversity and reduce bias without the need to change the image generators themselves. While occasionally producing results that diverge from original user intent for elaborate prompts, this approach generally provides more varied interpretations of underspecified requests rather than superficial variations. The method works particularly well for less advanced image generators, though limitations persist for certain contexts like disability representation. All prompts and generated images are available at https://iisys-hof.github.io/llm-prompt-img-gen/
CLJun 11, 2021
Sprachsynthese -- State-of-the-Art in englischer und deutscher SpracheRené Peinl
Reading text aloud is an important feature for modern computer applications. It not only facilitates access to information for visually impaired people, but is also a pleasant convenience for non-impaired users. In this article, the state of the art of speech synthesis is presented separately for mel-spectrogram generation and vocoders. It concludes with an overview of available data sets for English and German with a discussion of the transferability of the good speech synthesis results from English to German language.
HCFeb 17, 2020
Presence in VR experiences -- an empirical cost-benefit-analysisRené Peinl, Tobias Wirth
Virtual reality (VR) is on the edge of getting a mainstream platform for gaming, education and product design. The feeling of being present in the virtual world is influenced by many factors and even more intriguing a single negative influence can destroy the illusion that was created with a lot of effort by other measures. Therefore, it is crucial to have a balance between the influencing factors, know the importance of the factors and have a good estimation of how much effort it takes to bring each factor to a certain level of fidelity. This paper collects influencing factors discussed in literature, analyses the immersion of current off-the-shelf VR-solutions and presents results from an empirical study on efforts and benefits from certain aspects influencing presence in VR experiences. It turns out, that sometimes delivering high fidelity is easier to achieve than medium fidelity and for other aspects it is worthwhile investing more effort to achieve higher fidelity to improve presence a lot.
SESep 14, 2017
ClouNS - A Cloud-native Application Reference Model for Enterprise ArchitectsNane Kratzke, René Peinl
The capability to operate cloud-native applications can generate enormous business growth and value. But enterprise architects should be aware that cloud-native applications are vulnerable to vendor lock-in. We investigated cloud-native application design principles, public cloud service providers, and industrial cloud standards. All results indicate that most cloud service categories seem to foster vendor lock-in situations which might be especially problematic for enterprise architectures. This might sound disillusioning at first. However, we present a reference model for cloud-native applications that relies only on a small subset of well standardized IaaS services. The reference model can be used for codifying cloud technologies. It can guide technology identification, classification, adoption, research and development processes for cloud-native application and for vendor lock-in aware enterprise architecture engineering methodologies.