Georg Groh

CL
h-index24
48papers
3,220citations
Novelty34%
AI Score54

48 Papers

AIApr 10, 2022
"That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks

Edoardo Mosca, Shreyash Agarwal, Javier Rando et al. · eth-zurich

Adversarial attacks are a major challenge faced by current machine learning research. These purposely crafted inputs fool even the most advanced models, precluding their deployment in safety-critical applications. Extensive research in computer vision has been carried to develop reliable defense strategies. However, the same issue remains less explored in natural language processing. Our work presents a model-agnostic detector of adversarial text examples. The approach identifies patterns in the logits of the target classifier when perturbing the input text. The proposed detector improves the current state-of-the-art performance in recognizing adversarial inputs and exhibits strong generalization capabilities across different NLP models, datasets, and word-level attacks.

CLMar 6, 2023
IFAN: An Explainability-Focused Interaction Framework for Humans and NLP Models

Edoardo Mosca, Daryna Dementieva, Tohid Ebrahim Ajdari et al.

Interpretability and human oversight are fundamental pillars of deploying complex NLP models into real-world applications. However, applying explainability and human-in-the-loop methods requires technical proficiency. Despite existing toolkits for model understanding and analysis, options to integrate human feedback are still limited. We propose IFAN, a framework for real-time explanation-based interaction with NLP models. Through IFAN's interface, users can provide feedback to selected model explanations, which is then integrated through adapter layers to align the model with human rationale. We show the system to be effective in debiasing a hate speech classifier with minimal impact on performance. IFAN also offers a visual admin system and API to manage models (and datasets) as well as control access rights. A demo is live at https://ifan.ml.

87.4CLMay 26
Granuscore: A Reference-Free Measure of Granularity for Text Analysis and Question Answering

Lukas Ellinger, Alexander Fichtl, Miriam Anschütz et al.

Natural language conveys information at varying levels of granularity, from fine-grained references to broad descriptions. While granularity is fundamental to human communication, existing measures mostly capture surface detail or sentence specificity. We introduce Granuscore, a reference-free measure of granularity that leverages structural properties of a hierarchical embedding space. Granuscore reliably recovers hierarchical orderings on the Granola-EQ dataset and captures expected differences in granularity across discourse contexts. Across domains, we further show that Granuscore explains non-linear variation in sentence specificity beyond sentence length. Finally, we apply Granuscore to four question-answering benchmarks and analyze how granularity differs for questions, gold answers, and model outputs across response outcomes. The analysis reveals consistent differences in model behavior and provides a principled lens for characterizing the difficulty of QA datasets. Together, the results position Granuscore as a scalable, broadly applicable tool for analyzing granularity in text.

CLJul 26, 2023
This is not correct! Negation-aware Evaluation of Language Generation Systems

Miriam Anschütz, Diego Miguel Lozano, Georg Groh

Large language models underestimate the impact of negations on how much they change the meaning of a sentence. Therefore, learned evaluation metrics based on these models are insensitive to negations. In this paper, we propose NegBLEURT, a negation-aware version of the BLEURT evaluation metric. For that, we designed a rule-based sentence negation tool and used it to create the CANNOT negation evaluation dataset. Based on this dataset, we fine-tuned a sentence transformer and an evaluation metric to improve their negation sensitivity. Evaluating these models on existing benchmarks shows that our fine-tuned models outperform existing metrics on the negated sentences by far while preserving their base models' performances on other perturbations.

89.4CLMay 10Code
Assessment of RAG and Fine-Tuning for Industrial Question-Answering-Applications

Jakob Sturm, Josef Pichlmeier, Christian Bernhard et al.

Large Language Models (LLMs) are increasingly employed in enterprise question-answering (QA) systems, requiring adaptation to domain-specific knowledge. Among the most prevalent methods for incorporating such knowledge are Retrieval-Augmented Generation (RAG) and fine-tuning (FT). Yet, from a cost-accuracy trade-off perspective, it remains unclear which approach best suits industry scenarios. This study examines the impact of RAG and FT on two closed datasets specific to the automotive industry, assessing answer quality and operational costs. We extend the Cost-of-Pass framework proposed by Erol et al. (arXiv:2504.13359) to jointly assess output quality, generation cost, and user interaction cost. Our findings reveal that while premium models perform best out of the box, open-source models can achieve comparable quality when enhanced with RAG. Overall, RAG emerges as the most effective and cost-efficient adaptation method for both closed- and open-source models.

AIJan 29
Depth-Recurrent Attention Mixtures: Giving Latent Reasoning the Attention it Deserves

Jonas Knupp, Jan Hendrik Metzen, Jeremias Bohn et al.

Depth-recurrence facilitates latent reasoning by sharing parameters across depths. However, prior work lacks combined FLOP-, parameter-, and memory-matched baselines, underutilizes depth-recurrence due to partially fixed layer stacks, and ignores the bottleneck of constant hidden-sizes that restricts many-step latent reasoning. To address this, we introduce a modular framework of depth-recurrent attention mixtures (Dreamer), combining sequence attention, depth attention, and sparse expert attention. It alleviates the hidden-size bottleneck through attention along depth, decouples scaling dimensions, and allows depth-recurrent models to scale efficiently and effectively. Across language reasoning benchmarks, our models require 2 to 8x fewer training tokens for the same accuracy as FLOP-, parameter-, and memory-matched SOTA, and outperform ca. 2x larger SOTA models with the same training tokens. We further present insights into knowledge usage across depths, e.g., showing 2 to 11x larger expert selection diversity than SOTA MoEs.

CLAug 20, 2024
Crafting Tomorrow's Headlines: Neural News Generation and Detection in English, Turkish, Hungarian, and Persian

Cem Üyük, Danica Rovó, Shaghayegh Kolli et al.

In the era dominated by information overload and its facilitation with Large Language Models (LLMs), the prevalence of misinformation poses a significant threat to public discourse and societal well-being. A critical concern at present involves the identification of machine-generated news. In this work, we take a significant step by introducing a benchmark dataset designed for neural news detection in four languages: English, Turkish, Hungarian, and Persian. The dataset incorporates outputs from multiple multilingual generators (in both, zero-shot and fine-tuned setups) such as BloomZ, LLaMa-2, Mistral, Mixtral, and GPT-4. Next, we experiment with a variety of classifiers, ranging from those based on linguistic features to advanced Transformer-based models and LLMs prompting. We present the detection results aiming to delve into the interpretablity and robustness of machine-generated texts detectors across all target languages.

IROct 27, 2022
Retrieving Users' Opinions on Social Media with Multimodal Aspect-Based Sentiment Analysis

Miriam Anschütz, Tobias Eder, Georg Groh

People post their opinions and experiences on social media, yielding rich databases of end-users' sentiments. This paper shows to what extent machine learning can analyze and structure these databases. An automated data analysis pipeline is deployed to provide insights into user-generated content for researchers in other domains. First, the domain expert can select an image and a term of interest. Then, the pipeline uses image retrieval to find all images showing similar content and applies aspect-based sentiment analysis to outline users' opinions about the selected term. As part of an interdisciplinary project between architecture and computer science researchers, an empirical study of Hamburg's Elbphilharmonie was conveyed. Therefore, we selected 300 thousand posts with the hashtag \enquote{\texttt{hamburg}} from the platform Flickr. Image retrieval methods generated a subset of slightly more than 1.5 thousand images displaying the Elbphilharmonie. We found that these posts mainly convey a neutral or positive sentiment towards it. With this pipeline, we suggest a new semantic computing method that offers novel insights into end-users opinions, e.g., for architecture domain experts.

CLOct 20, 2024Code
A Comprehensive Evaluation of Cognitive Biases in LLMs

Simon Malberg, Roman Poletukhin, Carolin M. Schuster et al.

We present a large-scale evaluation of 30 cognitive biases in 20 state-of-the-art large language models (LLMs) under various decision-making scenarios. Our contributions include a novel general-purpose test framework for reliable and large-scale generation of tests for LLMs, a benchmark dataset with 30,000 tests for detecting cognitive biases in LLMs, and a comprehensive assessment of the biases found in the 20 evaluated LLMs. Our work confirms and broadens previous findings suggesting the presence of cognitive biases in LLMs by reporting evidence of all 30 tested biases in at least some of the 20 LLMs. We publish our framework code to encourage future research on biases in LLMs: https://github.com/simonmalberg/cognitive-biases-in-llms

HCSep 29, 2025
Diamonds in the rough: Transforming SPARCs of imagination into a game concept by leveraging medium sized LLMs

Julian Geheeb, Farhan Abid Ivan, Daniel Dyrda et al.

Recent research has demonstrated that large language models (LLMs) can support experts across various domains, including game design. In this study, we examine the utility of medium-sized LLMs, models that operate on consumer-grade hardware typically available in small studios or home environments. We began by identifying ten key aspects that contribute to a strong game concept and used ChatGPT to generate thirty sample game ideas. Three medium-sized LLMs, LLaMA 3.1, Qwen 2.5, and DeepSeek-R1, were then prompted to evaluate these ideas according to the previously identified aspects. A qualitative assessment by two researchers compared the models' outputs, revealing that DeepSeek-R1 produced the most consistently useful feedback, despite some variability in quality. To explore real-world applicability, we ran a pilot study with ten students enrolled in a storytelling course for game development. At the early stages of their own projects, students used our prompt and DeepSeek-R1 to refine their game concepts. The results indicate a positive reception: most participants rated the output as high quality and expressed interest in using such tools in their workflows. These findings suggest that current medium-sized LLMs can provide valuable feedback in early game design, though further refinement of prompting methods could improve consistency and overall effectiveness.

AIFeb 5Code
Agent2Agent Threats in Safety-Critical LLM Assistants: A Human-Centric Taxonomy

Lukas Stappen, Ahmet Erkan Turan, Johann Hagerer et al.

The integration of Large Language Model (LLM)-based conversational agents into vehicles creates novel security challenges at the intersection of agentic AI, automotive safety, and inter-agent communication. As these intelligent assistants coordinate with external services via protocols such as Google's Agent-to-Agent (A2A), they establish attack surfaces where manipulations can propagate through natural language payloads, potentially causing severe consequences ranging from driver distraction to unauthorized vehicle control. Existing AI security frameworks, while foundational, lack the rigorous "separation of concerns" standard in safety-critical systems engineering by co-mingling the concepts of what is being protected (assets) with how it is attacked (attack paths). This paper addresses this methodological gap by proposing a threat modeling framework called AgentHeLLM (Agent Hazard Exploration for LLM Assistants) that formally separates asset identification from attack path analysis. We introduce a human-centric asset taxonomy derived from harm-oriented "victim modeling" and inspired by the Universal Declaration of Human Rights, and a formal graph-based model that distinguishes poison paths (malicious data propagation) from trigger paths (activation actions). We demonstrate the framework's practical applicability through an open-source attack path suggestion tool AgentHeLLM Attack Path Generator that automates multi-stage threat discovery using a bi-level search strategy.

CLDec 23, 2022
From Judgement's Premises Towards Key Points

Oren Sultan, Rayen Dhahri, Yauheni Mardan et al.

Key Point Analysis(KPA) is a relatively new task in NLP that combines summarization and classification by extracting argumentative key points (KPs) for a topic from a collection of texts and categorizing their closeness to the different arguments. In our work, we focus on the legal domain and develop methods that identify and extract KPs from premises derived from texts of judgments. The first method is an adaptation to an existing state-of-the-art method, and the two others are new methods that we developed from scratch. We present our methods and examples of their outputs, as well a comparison between them. The full evaluation of our results is done in the matching task -- match between the generated KPs to arguments (premises).

SEMay 29, 2025Code
From Knowledge to Noise: CTIM-Rover and the Pitfalls of Episodic Memory in Software Engineering Agents

Tobias Lindenbauer, Georg Groh, Hinrich Schütze

We introduce CTIM-Rover, an AI agent for Software Engineering (SE) built on top of AutoCodeRover (Zhang et al., 2024) that extends agentic reasoning frameworks with an episodic memory, more specifically, a general and repository-level Cross-Task-Instance Memory (CTIM). While existing open-source SE agents mostly rely on ReAct (Yao et al., 2023b), Reflexion (Shinn et al., 2023), or Code-Act (Wang et al., 2024), all of these reasoning and planning frameworks inefficiently discard their long-term memory after a single task instance. As repository-level understanding is pivotal for identifying all locations requiring a patch for fixing a bug, we hypothesize that SE is particularly well positioned to benefit from CTIM. For this, we build on the Experiential Learning (EL) approach ExpeL (Zhao et al., 2024), proposing a Mixture-Of-Experts (MoEs) inspired approach to create both a general-purpose and repository-level CTIM. We find that CTIM-Rover does not outperform AutoCodeRover in any configuration and thus conclude that neither ExpeL nor DoT-Bank (Lingam et al., 2024) scale to real-world SE problems. Our analysis indicates noise introduced by distracting CTIM items or exemplar trajectories as the likely source of the performance degradation.

CLSep 24, 2024
Tuning Into Bias: A Computational Study of Gender Bias in Song Lyrics

Danqing Chen, Adithi Satish, Rasul Khanbayov et al.

The application of text mining methods is becoming increasingly prevalent, particularly within Humanities and Computational Social Sciences, as well as in a broader range of disciplines. This paper presents an analysis of gender bias in English song lyrics using topic modeling and bias measurement techniques. Leveraging BERTopic, we cluster a dataset of 537,553 English songs into distinct topics and analyze their temporal evolution. Our results reveal a significant thematic shift in song lyrics over time, transitioning from romantic themes to a heightened focus on the sexualization of women. Additionally, we observe a substantial prevalence of profanity and misogynistic content across various topics, with a particularly high concentration in the largest thematic cluster. To further analyse gender bias across topics and genres in a quantitative way, we employ the Single Category Word Embedding Association Test (SC-WEAT) to calculate bias scores for word embeddings trained on the most prominent topics as well as individual genres. The results indicate a consistent male bias in words associated with intelligence and strength, while appearance and weakness words show a female bias. Further analysis highlights variations in these biases across topics, illustrating the interplay between thematic content and gender stereotypes in song lyrics.

CLAug 25, 2025Code
German4All -- A Dataset and Model for Readability-Controlled Paraphrasing in German

Miriam Anschütz, Thanh Mai Pham, Eslam Nasrallah et al.

The ability to paraphrase texts across different complexity levels is essential for creating accessible texts that can be tailored toward diverse reader groups. Thus, we introduce German4All, the first large-scale German dataset of aligned readability-controlled, paragraph-level paraphrases. It spans five readability levels and comprises over 25,000 samples. The dataset is automatically synthesized using GPT-4 and rigorously evaluated through both human and LLM-based judgments. Using German4All, we train an open-source, readability-controlled paraphrasing model that achieves state-of-the-art performance in German text simplification, enabling more nuanced and reader-specific adaptations. We opensource both the dataset and the model to encourage further research on multi-level paraphrasing

47.2HCMay 10
LLMs are the Ideal Candidate for Mixed-Initiative Game Design Pillar Workflows

Julian Geheeb, Marvin Julian Schwarz, Daniel Dyrda et al.

Game Design Pillars are natural language artifacts commonly used in game development to communicate a project's core vision and ensure a coherent player experience. Their linguistic nature aligns well with the strengths of Large Language Models (LLMs), which excel at generating and interpreting natural language, making them strong candidates for supporting mixed-initiative workflows centered on design pillars. In this study, we introduce a formal definition of game design pillars, present an initial prototype -- SPINE -- and investigate the utility of LLMs in the creation and decision-making processes associated with pillar-driven workflows. We begin with a pre-study to identify an appropriate model, comparing \texttt{gemini-2.0-flash} and \texttt{GPT-4o-mini}. Results show that Gemini is better suited to our tasks due to its greater output variety and consistency. We then conduct a case study by deploying the tool at a local game jam. Findings indicate positive reception and clear value in integrating SPINE into early-stage development. Finally, we interview four experts, demonstrating the tool and allowing them to experiment with it in a controlled environment. While individual perspectives vary, the overall perception is encouraging and supports our intuition: LLMs can meaningfully contribute to game design pillar workflows. These early findings highlight the potential of formalizing pillar-driven design as a research space and point toward several promising avenues for future work.

AIFeb 18
Framework of Thoughts: A Foundation Framework for Dynamic and Optimized Reasoning based on Chains, Trees, and Graphs

Felix Fricke, Simon Malberg, Georg Groh

Prompting schemes such as Chain of Thought, Tree of Thoughts, and Graph of Thoughts can significantly enhance the reasoning capabilities of large language models. However, most existing schemes require users to define static, problem-specific reasoning structures that lack adaptability to dynamic or unseen problem types. Additionally, these schemes are often under-optimized in terms of hyperparameters, prompts, runtime, and prompting cost. To address these limitations, we introduce Framework of Thoughts (FoT)--a general-purpose foundation framework for building and optimizing dynamic reasoning schemes. FoT comes with built-in features for hyperparameter tuning, prompt optimization, parallel execution, and intelligent caching, unlocking the latent performance potential of reasoning schemes. We demonstrate FoT's capabilities by implementing three popular schemes--Tree of Thoughts, Graph of Thoughts, and ProbTree--within FoT. We empirically show that FoT enables significantly faster execution, reduces costs, and achieves better task scores through optimization. We release our codebase to facilitate the development of future dynamic and efficient reasoning schemes.

72.8CLMar 24
Beyond Hate: Differentiating Uncivil and Intolerant Speech in Multimodal Content Moderation

Nils A. Herrmann, Tobias Eder, Jingyi He et al.

Current multimodal toxicity benchmarks typically use a single binary hatefulness label. This coarse approach conflates two fundamentally different characteristics of expression: tone and content. Drawing on communication science theory, we introduce a fine-grained annotation scheme that distinguishes two separable dimensions: incivility (rude or dismissive tone) and intolerance (content that attacks pluralism and targets groups or identities) and apply it to 2,030 memes from the Hateful Memes dataset. We evaluate different vision-language models under coarse-label training, transfer learning across label schemes and a joint learning approach that combines the coarse hatefulness label with our fine-grained annotations. Our results show that fine-grained annotations complement existing coarse labels and, when used jointly, improve overall model performance. Moreover, models trained with the fine-grained scheme exhibit more balanced moderation-relevant error profiles and are less prone to under-detection of harmful content than models trained on hatefulness labels alone (FNR-FPR, the difference between false negative and false positive rates: 0.74 to 0.42 for LLaVA-1.6-Mistral-7B; 0.54 to 0.28 for Qwen2.5-VL-7B). This work contributes to data-centric approaches in content moderation by improving the reliability and accuracy of moderation systems through enhanced data quality. Overall, combining both coarse and fine-grained labels provides a practical route to more reliable multimodal moderation.

CLApr 27, 2024
Toxicity Classification in Ukrainian

Daryna Dementieva, Valeriia Khylenko, Nikolay Babakov et al.

The task of toxicity detection is still a relevant task, especially in the context of safe and fair LMs development. Nevertheless, labeled binary toxicity classification corpora are not available for all languages, which is understandable given the resource-intensive nature of the annotation process. Ukrainian, in particular, is among the languages lacking such resources. To our knowledge, there has been no existing toxicity classification corpus in Ukrainian. In this study, we aim to fill this gap by investigating cross-lingual knowledge transfer techniques and creating labeled corpora by: (i)~translating from an English corpus, (ii)~filtering toxic samples using keywords, and (iii)~annotating with crowdsourcing. We compare LLMs prompting and other cross-lingual transfer approaches with and without fine-tuning offering insights into the most robust and efficient baselines.

CLApr 2, 2024
Cross-lingual Text Classification Transfer: The Case of Ukrainian

Daryna Dementieva, Valeriia Khylenko, Georg Groh

Despite the extensive amount of labeled datasets in the NLP text classification field, the persistent imbalance in data availability across various languages remains evident. To support further fair development of NLP models, exploring the possibilities of effective knowledge transfer to new languages is crucial. Ukrainian, in particular, stands as a language that still can benefit from the continued refinement of cross-lingual methodologies. Due to our knowledge, there is a tremendous lack of Ukrainian corpora for typical text classification tasks, i.e., different types of style, or harmful speech, or texts relationships. However, the amount of resources required for such corpora collection from scratch is understandable. In this work, we leverage the state-of-the-art advances in NLP, exploring cross-lingual knowledge transfer methods avoiding manual data curation: large multilingual encoders and translation systems, LLMs, and language adapters. We test the approaches on three text classification tasks -- toxicity classification, formality classification, and natural language inference (NLI) -- providing the ``recipe'' for the optimal setups for each task.

69.4HCApr 21
seneca: A Personalized Conversational Planner

Simon Bohnen, Gabriel Garbers, Lukas Ellinger et al.

Knowledge work demands sustained self-regulation, prioritization, and reflection-yet existing planning tools only partially support these needs. Digital to-do list applications feature task persistence but lack goal representation. Paper-based planning frameworks offer effective planning strategies but cannot adapt to individual users. Conversational AI systems enable flexible reflection but lack persistence and accountability. Moreover, none of these tools address a fundamental challenge: users' expressed demands often diverge from their underlying needs. This paper introduces seneca, a conceptual framework for a personalized, AI-assisted planner that integrates the complementary strengths of these three approaches. seneca combines a conversational agent that scaffolds reflection and asks clarifying questions, a persistent database that tracks goals and behavioral patterns, and a processor that synchronizes information between them. We describe this architecture and outline a phased evaluation strategy combining automated testing with simulated users and longitudinal human studies measuring goal attainment, planning realism, and goal-value alignment.

CLNov 25, 2024
Profiling Bias in LLMs: Stereotype Dimensions in Contextual Word Embeddings

Carolin M. Schuster, Maria-Alexandra Dinisor, Shashwat Ghatiwala et al.

Large language models (LLMs) are the foundation of the current successes of artificial intelligence (AI), however, they are unavoidably biased. To effectively communicate the risks and encourage mitigation efforts these models need adequate and intuitive descriptions of their discriminatory properties, appropriate for all audiences of AI. We suggest bias profiles with respect to stereotype dimensions based on dictionaries from social psychology research. Along these dimensions we investigate gender bias in contextual embeddings, across contexts and layers, and generate stereotype profiles for twelve different LLMs, demonstrating their intuition and use case for exposing and visualizing bias.

CLJul 16, 2025
Simplifications are Absolutists: How Simplified Language Reduces Word Sense Awareness in LLM-Generated Definitions

Lukas Ellinger, Miriam Anschütz, Georg Groh

Large Language Models (LLMs) can provide accurate word definitions and explanations for any context. However, the scope of the definition changes for different target groups, like children or language learners. This is especially relevant for homonyms, words with multiple meanings, where oversimplification might risk information loss by omitting key senses, potentially misleading users who trust LLM outputs. We investigate how simplification impacts homonym definition quality across three target groups: Normal, Simple, and ELI5. Using two novel evaluation datasets spanning multiple languages, we test DeepSeek v3, Llama 4 Maverick, Qwen3-30B A3B, GPT-4o mini, and Llama 3.1 8B via LLM-as-Judge and human annotations. Our results show that simplification drastically degrades definition completeness by neglecting polysemy, increasing the risk of misunderstanding. Fine-tuning Llama 3.1 8B with Direct Preference Optimization substantially improves homonym response quality across all prompt types. These findings highlight the need to balance simplicity and completeness in educational NLP to ensure reliable, context-aware definitions for all learners.

CLJul 1, 2025
TUM-MiKaNi at SemEval-2025 Task 3: Towards Multilingual and Knowledge-Aware Non-factual Hallucination Identification

Miriam Anschütz, Ekaterina Gikalo, Niklas Herbster et al.

Hallucinations are one of the major problems of LLMs, hindering their trustworthiness and deployment to wider use cases. However, most of the research on hallucinations focuses on English data, neglecting the multilingual nature of LLMs. This paper describes our submission to the SemEval-2025 Task-3 - Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. We propose a two-part pipeline that combines retrieval-based fact verification against Wikipedia with a BERT-based system fine-tuned to identify common hallucination patterns. Our system achieves competitive results across all languages, reaching top-10 results in eight languages, including English. Moreover, it supports multiple languages beyond the fourteen covered by the shared task. This multilingual hallucination identifier can help to improve LLM outputs and their usefulness in the future.

CLNov 25, 2024
Adapter-based Approaches to Knowledge-enhanced Language Models -- A Survey

Alexander Fichtl, Juraj Vladika, Georg Groh

Knowledge-enhanced language models (KELMs) have emerged as promising tools to bridge the gap between large-scale language models and domain-specific knowledge. KELMs can achieve higher factual accuracy and mitigate hallucinations by leveraging knowledge graphs (KGs). They are frequently combined with adapter modules to reduce the computational load and risk of catastrophic forgetting. In this paper, we conduct a systematic literature review (SLR) on adapter-based approaches to KELMs. We provide a structured overview of existing methodologies in the field through quantitative and qualitative analysis and explore the strengths and potential shortcomings of individual approaches. We show that general knowledge and domain-specific approaches have been frequently explored along with various adapter architectures and downstream tasks. We particularly focused on the popular biomedical domain, where we provided an insightful performance comparison of existing KELMs. We outline the main trends and propose promising future directions.

CLApr 10, 2024
Simpler becomes Harder: Do LLMs Exhibit a Coherent Behavior on Simplified Corpora?

Miriam Anschütz, Edoardo Mosca, Georg Groh

Text simplification seeks to improve readability while retaining the original content and meaning. Our study investigates whether pre-trained classifiers also maintain such coherence by comparing their predictions on both original and simplified inputs. We conduct experiments using 11 pre-trained models, including BERT and OpenAI's GPT 3.5, across six datasets spanning three languages. Additionally, we conduct a detailed analysis of the correlation between prediction change rates and simplification types/strengths. Our findings reveal alarming inconsistencies across all languages and models. If not promptly addressed, simplified inputs can be easily exploited to craft zero-iteration model-agnostic adversarial attacks with success rates of up to 50%

MAFeb 19
DIALECTIC: A Multi-Agent System for Startup Evaluation

Jae Yoon Bae, Simon Malberg, Joyce Galang et al.

Venture capital (VC) investors face a large number of investment opportunities but only invest in few of these, with even fewer ending up successful. Early-stage screening of opportunities is often limited by investor bandwidth, demanding tradeoffs between evaluation diligence and number of opportunities assessed. To ease this tradeoff, we introduce DIALECTIC, an LLM-based multi-agent system for startup evaluation. DIALECTIC first gathers factual knowledge about a startup and organizes these facts into a hierarchical question tree. It then synthesizes the facts into natural-language arguments for and against an investment and iteratively critiques and refines these arguments through a simulated debate, which surfaces only the most convincing arguments. Our system also produces numeric decision scores that allow investors to rank and thus efficiently prioritize opportunities. We evaluate DIALECTIC through backtesting on real investment opportunities aggregated from five VC funds, showing that DIALECTIC matches the precision of human VCs in predicting startup success.

CROct 13, 2025
Deep Research Brings Deeper Harm

Shuo Chen, Zonggen Li, Zhen Han et al. · deepmind, oxford

Deep Research (DR) agents built on Large Language Models (LLMs) can perform complex, multi-step research by decomposing tasks, retrieving online information, and synthesizing detailed reports. However, the misuse of LLMs with such powerful capabilities can lead to even greater risks. This is especially concerning in high-stakes and knowledge-intensive domains such as biosecurity, where DR can generate a professional report containing detailed forbidden knowledge. Unfortunately, we have found such risks in practice: simply submitting a harmful query, which a standalone LLM directly rejects, can elicit a detailed and dangerous report from DR agents. This highlights the elevated risks and underscores the need for a deeper safety analysis. Yet, jailbreak methods designed for LLMs fall short in exposing such unique risks, as they do not target the research ability of DR agents. To address this gap, we propose two novel jailbreak strategies: Plan Injection, which injects malicious sub-goals into the agent's plan; and Intent Hijack, which reframes harmful queries as academic research questions. We conducted extensive experiments across different LLMs and various safety benchmarks, including general and biosecurity forbidden prompts. These experiments reveal 3 key findings: (1) Alignment of the LLMs often fail in DR agents, where harmful prompts framed in academic terms can hijack agent intent; (2) Multi-step planning and execution weaken the alignment, revealing systemic vulnerabilities that prompt-level safeguards cannot address; (3) DR agents not only bypass refusals but also produce more coherent, professional, and dangerous content, compared with standalone LLMs. These results demonstrate a fundamental misalignment in DR agents and call for better alignment techniques tailored to DR agents. Code and datasets are available at https://chenxshuo.github.io/deeper-harm.

CLOct 6, 2025
The End of Transformers? On Challenging Attention and the Rise of Sub-Quadratic Architectures

Alexander M. Fichtl, Jeremias Bohn, Josefin Kelber et al.

Transformers have dominated sequence processing tasks for the past seven years -- most notably language modeling. However, the inherent quadratic complexity of their attention mechanism remains a significant bottleneck as context length increases. This paper surveys recent efforts to overcome this bottleneck, including advances in (sub-quadratic) attention variants, recurrent neural networks, state space models, and hybrid architectures. We critically analyze these approaches in terms of compute and memory complexity, benchmark results, and fundamental limitations to assess whether the dominance of pure-attention transformers may soon be challenged.

CLSep 19, 2025
It Depends: Resolving Referential Ambiguity in Minimal Contexts with Commonsense Knowledge

Lukas Ellinger, Georg Groh

Ambiguous words or underspecified references require interlocutors to resolve them, often by relying on shared context and commonsense knowledge. Therefore, we systematically investigate whether Large Language Models (LLMs) can leverage commonsense to resolve referential ambiguity in multi-turn conversations and analyze their behavior when ambiguity persists. Further, we study how requests for simplified language affect this capacity. Using a novel multilingual evaluation dataset, we test DeepSeek v3, GPT-4o, Qwen3-32B, GPT-4o-mini, and Llama-3.1-8B via LLM-as-Judge and human annotations. Our findings indicate that current LLMs struggle to resolve ambiguity effectively: they tend to commit to a single interpretation or cover all possible references, rather than hedging or seeking clarification. This limitation becomes more pronounced under simplification prompts, which drastically reduce the use of commonsense reasoning and diverse response strategies. Fine-tuning Llama-3.1-8B with Direct Preference Optimization substantially improves ambiguity resolution across all request types. These results underscore the need for advanced fine-tuning to improve LLMs' handling of ambiguity and to ensure robust performance across diverse communication styles.

CLMay 19, 2025
To Bias or Not to Bias: Detecting bias in News with bias-detector

Himel Ghosh, Ahmed Mosharafa, Georg Groh

Media bias detection is a critical task in ensuring fair and balanced information dissemination, yet it remains challenging due to the subjectivity of bias and the scarcity of high-quality annotated data. In this work, we perform sentence-level bias classification by fine-tuning a RoBERTa-based model on the expert-annotated BABE dataset. Using McNemar's test and the 5x2 cross-validation paired t-test, we show statistically significant improvements in performance when comparing our model to a domain-adaptively pre-trained DA-RoBERTa baseline. Furthermore, attention-based analysis shows that our model avoids common pitfalls like oversensitivity to politically charged terms and instead attends more meaningfully to contextually relevant tokens. For a comprehensive examination of media bias, we present a pipeline that combines our model with an already-existing bias-type classifier. Our method exhibits good generalization and interpretability, despite being constrained by sentence-level analysis and dataset size because of a lack of larger and more advanced bias corpora. We talk about context-aware modeling, bias neutralization, and advanced bias type classification as potential future directions. Our findings contribute to building more robust, explainable, and socially responsible NLP systems for media bias detection.

CLOct 28, 2024
Semantic Component Analysis: Introducing Multi-Topic Distributions to Clustering-Based Topic Modeling

Florian Eichin, Carolin M. Schuster, Georg Groh et al.

Topic modeling is a key method in text analysis, but existing approaches fail to efficiently scale to large datasets or are limited by assuming one topic per document. Overcoming these limitations, we introduce Semantic Component Analysis (SCA), a topic modeling technique that discovers multiple topics per sample by introducing a decomposition step to the clustering-based topic modeling framework. We evaluate SCA on Twitter datasets in English, Hausa and Chinese. There, it achieves competitive coherence and diversity compared to BERTopic, while uncovering at least double the topics and maintaining a noise rate close to zero. We also find that SCA outperforms the LLM-based TopicGPT in scenarios with similar compute budgets. SCA thus provides an effective and efficient approach for topic modeling of large datasets.

CLMay 22, 2023
Language Models for German Text Simplification: Overcoming Parallel Data Scarcity through Style-specific Pre-training

Miriam Anschütz, Joshua Oehms, Thomas Wimmer et al.

Automatic text simplification systems help to reduce textual information barriers on the internet. However, for languages other than English, only few parallel data to train these systems exists. We propose a two-step approach to overcome this data scarcity issue. First, we fine-tuned language models on a corpus of German Easy Language, a specific style of German. Then, we used these models as decoders in a sequence-to-sequence simplification task. We show that the language models adapt to the style characteristics of Easy Language and output more accessible texts. Moreover, with the style-specific pre-training, we reduced the number of trainable parameters in text simplification models. Hence, less parallel data is sufficient for training. Our results indicate that pre-training on unaligned data can reduce the required parallel data while improving the performance on downstream tasks.

CLMay 15, 2023
AdamR at SemEval-2023 Task 10: Solving the Class Imbalance Problem in Sexism Detection with Ensemble Learning

Adam Rydelek, Daryna Dementieva, Georg Groh

The Explainable Detection of Online Sexism task presents the problem of explainable sexism detection through fine-grained categorisation of sexist cases with three subtasks. Our team experimented with different ways to combat class imbalance throughout the tasks using data augmentation and loss alteration techniques. We tackled the challenge by utilising ensembles of Transformer models trained on different datasets, which are tested to find the balance between performance and interpretability. This solution ranked us in the top 40\% of teams for each of the tracks.

CLMay 15, 2023
Adam-Smith at SemEval-2023 Task 4: Discovering Human Values in Arguments with Ensembles of Transformer-based Models

Daniel Schroter, Daryna Dementieva, Georg Groh

This paper presents the best-performing approach alias "Adam Smith" for the SemEval-2023 Task 4: "Identification of Human Values behind Arguments". The goal of the task was to create systems that automatically identify the values within textual arguments. We train transformer-based models until they reach their loss minimum or f1-score maximum. Ensembling the models by selecting one global decision threshold that maximizes the f1-score leads to the best-performing system in the competition. Ensembling based on stacking with logistic regressions shows the best performance on an additional dataset provided to evaluate the robustness ("Nahj al-Balagha"). Apart from outlining the submitted system, we demonstrate that the use of the large ensemble model is not necessary and that the system size can be significantly reduced.

CLNov 18, 2021
How to Build Robust FAQ Chatbot with Controllable Question Generator?

Yan Pan, Mingyang Ma, Bernhard Pflugfelder et al.

Many unanswerable adversarial questions fool the question-answer (QA) system with some plausible answers. Building a robust, frequently asked questions (FAQ) chatbot needs a large amount of diverse adversarial examples. Recent question generation methods are ineffective at generating many high-quality and diverse adversarial question-answer pairs from unstructured text. We propose the diversity controllable semantically valid adversarial attacker (DCSA), a high-quality, diverse, controllable method to generate standard and adversarial samples with a semantic graph. The fluent and semantically generated QA pairs fool our passage retrieval model successfully. After that, we conduct a study on the robustness and generalization of the QA model with generated QA pairs among different domains. We find that the generated data set improves the generalizability of the QA model to the new target domain and the robustness of the QA model to detect unanswerable adversarial questions.

CLNov 3, 2021
End-to-End Annotator Bias Approximation on Crowdsourced Single-Label Sentiment Analysis

Gerhard Johann Hagerer, David Szabo, Andreas Koch et al.

Sentiment analysis is often a crowdsourcing task prone to subjective labels given by many annotators. It is not yet fully understood how the annotation bias of each annotator can be modeled correctly with state-of-the-art methods. However, resolving annotator bias precisely and reliably is the key to understand annotators' labeling behavior and to successfully resolve corresponding individual misconceptions and wrongdoings regarding the annotation task. Our contribution is an explanation and improvement for precise neural end-to-end bias modeling and ground truth estimation, which reduces an undesired mismatch in that regard of the existing state-of-the-art. Classification experiments show that it has potential to improve accuracy in cases where each sample is annotated only by one single annotator. We provide the whole source code publicly and release an own domain-specific sentiment dataset containing 10,000 sentences discussing organic food products. These are crawled from social media and are singly labeled by 10 non-expert annotators.

CLNov 3, 2021
A Case Study and Qualitative Analysis of Simple Cross-Lingual Opinion Mining

Gerhard Johann Hagerer, Wing Sheung Leung, Qiaoxi Liu et al.

User-generated content from social media is produced in many languages, making it technically challenging to compare the discussed themes from one domain across different cultures and regions. It is relevant for domains in a globalized world, such as market research, where people from two nations and markets might have different requirements for a product. We propose a simple, modern, and effective method for building a single topic model with sentiment analysis capable of covering multiple languages simultanteously, based on a pre-trained state-of-the-art deep neural network for natural language understanding. To demonstrate its feasibility, we apply the model to newspaper articles and user comments of a specific domain, i.e., organic food products and related consumption behavior. The themes match across languages. Additionally, we obtain an high proportion of stable and domain-relevant topics, a meaningful relation between topics and their respective textual contents, and an interpretable representation for social media documents. Marketing can potentially benefit from our method, since it provides an easy-to-use means of addressing specific customer interests from different market regions around the globe. For reproducibility, we provide the code, data, and results of our study.

HCOct 28, 2021
An Analysis of Programming Course Evaluations Before and After the Introduction of an Autograder

Gerhard Johann Hagerer, Laura Lahesoo, Miriam Anschütz et al.

Commonly, introductory programming courses in higher education institutions have hundreds of participating students eager to learn to program. The manual effort for reviewing the submitted source code and for providing feedback can no longer be managed. Manually reviewing the submitted homework can be subjective and unfair, particularly if many tutors are responsible for grading. Different autograders can help in this situation; however, there is a lack of knowledge about how autograders can impact students' overall perception of programming classes and teaching. This is relevant for course organizers and institutions to keep their programming courses attractive while coping with increasing students. This paper studies the answers to the standardized university evaluation questionnaires of multiple large-scale foundational computer science courses which recently introduced autograding. The differences before and after this intervention are analyzed. By incorporating additional observations, we hypothesize how the autograder might have contributed to the significant changes in the data, such as, improved interactions between tutors and students, improved overall course quality, improved learning success, increased time spent, and reduced difficulty. This qualitative study aims to provide hypotheses for future research to define and conduct quantitative surveys and data analysis. The autograder technology can be validated as a teaching method to improve student satisfaction with programming courses.

CLOct 20, 2021
SocialVisTUM: An Interactive Visualization Toolkit for Correlated Neural Topic Models on Social Media Opinion Mining

Gerhard Johann Hagerer, Martin Kirchhoff, Hannah Danner et al.

Recent research in opinion mining proposed word embedding-based topic modeling methods that provide superior coherence compared to traditional topic modeling. In this paper, we demonstrate how these methods can be used to display correlated topic models on social media texts using SocialVisTUM, our proposed interactive visualization toolkit. It displays a graph with topics as nodes and their correlations as edges. Further details are displayed interactively to support the exploration of large text collections, e.g., representative words and sentences of topics, topic and sentiment distributions, hierarchical topic clustering, and customizable, predefined topic labels. The toolkit optimizes automatically on custom data for optimal coherence. We show a working instance of the toolkit on data crawled from English social media discussions about organic food consumption. The visualization confirms findings of a qualitative consumer research study. SocialVisTUM and its training procedures are accessible online.

CLSep 15, 2021
Introducing an Abusive Language Classification Framework for Telegram to Investigate the German Hater Community

Maximilian Wich, Adrian Gorniak, Tobias Eder et al.

Since traditional social media platforms continue to ban actors spreading hate speech or other forms of abusive languages (a process known as deplatforming), these actors migrate to alternative platforms that do not moderate users content. One popular platform relevant for the German hater community is Telegram for which limited research efforts have been made so far. This study aims to develop a broad framework comprising (i) an abusive language classification model for German Telegram messages and (ii) a classification model for the hatefulness of Telegram channels. For the first part, we use existing abusive language datasets containing posts from other platforms to develop our classification models. For the channel classification model, we develop a method that combines channel-specific content information collected from a topic model with a social graph to predict the hatefulness of channels. Furthermore, we complement these two approaches for hate speech detection with insightful results on the evolution of the hater community on Telegram in Germany. We also propose methods for conducting scalable network analyses for social media platforms to the hate speech research community. As an additional output of this study, we provide an annotated abusive language dataset containing 1,149 annotated Telegram messages.

LGJun 29, 2021
Classification of Consumer Belief Statements From Social Media

Gerhard Johann Hagerer, Wenbin Le, Hannah Danner et al.

Social media offer plenty of information to perform market research in order to meet the requirements of customers. One way how this research is conducted is that a domain expert gathers and categorizes user-generated content into a complex and fine-grained class structure. In many of such cases, little data meets complex annotations. It is not yet fully understood how this can be leveraged successfully for classification. We examine the classification accuracy of expert labels when used with a) many fine-grained classes and b) few abstract classes. For scenario b) we compare abstract class labels given by the domain expert as baseline and by automatic hierarchical clustering. We compare this to another baseline where the entire class structure is given by a completely unsupervised clustering approach. By doing so, this work can serve as an example of how complex expert annotations are potentially beneficial and can be utilized in the most optimal way for opinion mining in highly specific domains. By exploring across a range of techniques and experiments, we find that automated class abstraction approaches in particular the unsupervised approach performs remarkably well against domain expert baseline on text classification tasks. This has the potential to inspire opinion mining applications in order to support market researchers in practice and to inspire fine-grained automated content analysis on a large scale.

CLMay 4, 2021
GraphTMT: Unsupervised Graph-based Topic Modeling from Video Transcripts

Lukas Stappen, Jason Thies, Gerhard Hagerer et al.

To unfold the tremendous amount of multimedia data uploaded daily to social media platforms, effective topic modeling techniques are needed. Existing work tends to apply topic models on written text datasets. In this paper, we propose a topic extractor on video transcripts. Exploiting neural word embeddings through graph-based clustering, we aim to improve usability and semantic coherence. Unlike most topic models, this approach works without knowing the true number of topics, which is important when no such assumption can or should be made. Experimental results on the real-life multimodal dataset MuSe-CaR demonstrates that our approach GraphTMT extracts coherent and meaningful topics and outperforms baseline methods. Furthermore, we successfully demonstrate the applicability of our approach on the popular Citysearch corpus.

CLAug 12, 2018
Sequence Labeling: A Practical Approach

Adnan Akhundov, Dietrich Trautmann, Georg Groh

We take a practical approach to solving sequence labeling problem assuming unavailability of domain expertise and scarcity of informational and computational resources. To this end, we utilize a universal end-to-end Bi-LSTM-based neural sequence labeling model applicable to a wide range of NLP tasks and languages. The model combines morphological, semantic, and structural cues extracted from data to arrive at informed predictions. The model's performance is evaluated on eight benchmark datasets (covering three tasks: POS-tagging, NER, and Chunking, and four languages: English, German, Dutch, and Spanish). We observe state-of-the-art results on four of them: CoNLL-2012 (English NER), CoNLL-2002 (Dutch NER), GermEval 2014 (German NER), Tiger Corpus (German POS-tagging), and competitive performance on the rest.

CYJul 7, 2016
Estimating the Dissemination of Social and Mobile Search in Categories of Information Needs Using Websites as Proxies

Christoph Fuchs, Akash Nayyar, Ruth Nussbaumer et al.

With the increasing popularity of social means to satisfy information needs using Social Media (e.g., Social Media Question Asking, SMQA) or Social Information Retrieval approaches, this paper tries to identify types of information needs which are inherently social and therefore better suited for those techniques. We describe an experiment where prominent websites from various content categories are used to represent their respective content area and allow to correlate attributes of the content areas. The underlying assumption is that successful websites for focused content areas perfectly align with the information seekers' requirements when satisfying information needs in the respective content areas. Based on a manually collected dataset of URLs from websites covering a broad range of topics taken from Alexa (http://www.alexa.com} (retrieved 2015-11-04)) (a company that publishes statistics about web traffic), a crowdsourcing approach is employed to rate the information needs that could get solved by the respective URLs according to several dimensions (incl. sociality and mobility) to investigate possible correlations with other attributes. Our results suggest that information needs which do not require a certain formal expertise play an important role in social information retrieval and that some content areas are better suited for social information retrieval (e.g., Factual Knowledge & News, Games, Lifestyle) than others (e.g., Health & Lifestyle).

MMJun 23, 2014
Designing Sound Collaboratively - Perceptually Motivated Audio Synthesis

Niklas Klügel, Timo Becker, Georg Groh

In this contribution, we will discuss a prototype that allows a group of users to design sound collaboratively in real time using a multi-touch tabletop. We make use of a machine learning method to generate a mapping from perceptual audio features to synthesis parameters. This mapping is then used for visualization and interaction. Finally, we discuss the results of a comparative evaluation study.

SIFeb 11, 2014
An evaluation of keyword extraction from online communication for the characterisation of social relations

Jan Hauffa, Tobias Lichtenberg, Georg Groh

The set of interpersonal relationships on a social network service or a similar online community is usually highly heterogenous. The concept of tie strength captures only one aspect of this heterogeneity. Since the unstructured text content of online communication artefacts is a salient source of information about a social relationship, we investigate the utility of keywords extracted from the message body as a representation of the relationship's characteristics as reflected by the conversation topics. Keyword extraction is performed using standard natural language processing methods. Communication data and human assessments of the extracted keywords are obtained from Facebook users via a custom application. The overall positive quality assessment provides evidence that the keywords indeed convey relevant information about the relationship.

SISep 13, 2012
Spatio-Temporal Small Worlds for Decentralized Information Retrieval in Social Networking

Georg Groh, Florian Straub, Benjamin Koster

We discuss foundations and options for alternative, agent-based information retrieval (IR) approaches in Social Networking, especially Decentralized and Mobile Social Networking scenarios. In addition to usual semantic contexts, these approaches make use of long-term social and spatio-temporal contexts in order to satisfy conscious as well as unconscious information needs according to Human IR heuristics. Using a large Twitter dataset, we investigate these approaches and especially investigate the question in how far spatio-temporal contexts can act as a conceptual bracket implicating social and semantic cohesion, giving rise to the concept of Spatio-Temporal Small Worlds.