Simone Kopeinik

LG
h-index21
10papers
292citations
Novelty31%
AI Score37

10 Papers

LGJul 19, 2023
Reproducibility in Machine Learning-Driven Research

Harald Semmelrock, Simone Kopeinik, Dieter Theiler et al.

Research is facing a reproducibility crisis, in which the results and findings of many studies are difficult or even impossible to reproduce. This is also the case in machine learning (ML) and artificial intelligence (AI) research. Often, this is the case due to unpublished data and/or source-code, and due to sensitivity to ML training conditions. Although different solutions to address this issue are discussed in the research community such as using ML platforms, the level of reproducibility in ML-driven research is not increasing substantially. Therefore, in this mini survey, we review the literature on reproducibility in ML-driven research with three main aims: (i) reflect on the current situation of ML reproducibility in various research fields, (ii) identify reproducibility issues and barriers that exist in these research fields applying ML, and (iii) identify potential drivers such as tools, practices, and interventions that support ML reproducibility. With this, we hope to contribute to decisions on the viability of different solutions for supporting ML reproducibility.

LGAug 17, 2022
Modelling the long-term fairness dynamics of data-driven targeted help on job seekers

Sebastian Scher, Simone Kopeinik, Andreas Trügler et al.

The use of data-driven decision support by public agencies is becoming more widespread and already influences the allocation of public resources. This raises ethical concerns, as it has adversely affected minorities and historically discriminated groups. In this paper, we use an approach that combines statistics and data-driven approaches with dynamical modeling to assess long-term fairness effects of labor market interventions. Specifically, we develop and use a model to investigate the impact of decisions caused by a public employment authority that selectively supports job-seekers through targeted help. The selection of who receives what help is based on a data-driven intervention model that estimates an individual's chances of finding a job in a timely manner and rests upon data that describes a population in which skills relevant to the labor market are unevenly distributed between two groups (e.g., males and females). The intervention model has incomplete access to the individual's actual skills and can augment this with knowledge of the individual's group affiliation, thus using a protected attribute to increase predictive accuracy. We assess this intervention model's dynamics -- especially fairness-related issues and trade-offs between different fairness goals -- over time and compare it to an intervention model that does not use group affiliation as a predictive feature. We conclude that in order to quantify the trade-off correctly and to assess the long-term fairness effects of such a system in the real-world, careful modeling of the surrounding labor market is indispensable.

LGFeb 7, 2023
A conceptual model for leaving the data-centric approach in machine learning

Sebastian Scher, Bernhard Geiger, Simone Kopeinik et al.

For a long time, machine learning (ML) has been seen as the abstract problem of learning relationships from data independent of the surrounding settings. This has recently been challenged, and methods have been proposed to include external constraints in the machine learning models. These methods usually come from application-specific fields, such as de-biasing algorithms in the field of fairness in ML or physical constraints in the fields of physics and engineering. In this paper, we present and discuss a conceptual high-level model that unifies these approaches in a common language. We hope that this will enable and foster exchange between the different fields and their different methods for including external constraints into ML models, and thus leaving purely data-centric approaches.

IRJan 2, 2019Code
The TagRec Framework as a Toolkit for the Development of Tag-Based Recommender Systems

Dominik Kowald, Simone Kopeinik, Elisabeth Lex

Recommender systems have become important tools to support users in identifying relevant content in an overloaded information space. To ease the development of recommender systems, a number of recommender frameworks have been proposed that serve a wide range of application domains. Our TagRec framework is one of the few examples of an open-source framework tailored towards developing and evaluating tag-based recommender systems. In this paper, we present the current, updated state of TagRec, and we summarize and reflect on four use cases that have been implemented with TagRec: (i) tag recommendations, (ii) resource recommendations, (iii) recommendation evaluation, and (iv) hashtag recommendations. To date, TagRec served the development and/or evaluation process of tag-based recommender systems in two large scale European research projects, which have been described in 17 research papers. Thus, we believe that this work is of interest for both researchers and practitioners of tag-based recommender systems.

LGNov 15, 2024
Establishing and Evaluating Trustworthy AI: Overview and Research Challenges

Dominik Kowald, Sebastian Scher, Viktoria Pammer-Schindler et al.

Artificial intelligence (AI) technologies (re-)shape modern life, driving innovation in a wide range of sectors. However, some AI systems have yielded unexpected or undesirable outcomes or have been used in questionable manners. As a result, there has been a surge in public and academic discussions about aspects that AI systems must fulfill to be considered trustworthy. In this paper, we synthesize existing conceptualizations of trustworthy AI along six requirements: 1) human agency and oversight, 2) fairness and non-discrimination, 3) transparency and explainability, 4) robustness and accuracy, 5) privacy and security, and 6) accountability. For each one, we provide a definition, describe how it can be established and evaluated, and discuss requirement-specific research challenges. Finally, we conclude this analysis by identifying overarching research challenges across the requirements with respect to 1) interdisciplinary research, 2) conceptual clarity, 3) context-dependency, 4) dynamics in evolving systems, and 5) investigations in real-world contexts. Thus, this paper synthesizes and consolidates a wide-ranging and active discussion currently taking place in various academic sub-communities and public forums. It aims to serve as a reference for a broad audience and as a basis for future research directions.

LGOct 2, 2025
Private and Fair Machine Learning: Revisiting the Disparate Impact of Differentially Private SGD

Lea Demelius, Dominik Kowald, Simone Kopeinik et al.

Differential privacy (DP) is a prominent method for protecting information about individuals during data analysis. Training neural networks with differentially private stochastic gradient descent (DPSGD) influences the model's learning dynamics and, consequently, its output. This can affect the model's performance and fairness. While the majority of studies on the topic report a negative impact on fairness, it has recently been suggested that fairness levels comparable to non-private models can be achieved by optimizing hyperparameters for performance directly on differentially private models (rather than re-using hyperparameters from non-private models, as is common practice). In this work, we analyze the generalizability of this claim by 1) comparing the disparate impact of DPSGD on different performance metrics, and 2) analyzing it over a wide range of hyperparameter settings. We highlight that a disparate impact on one metric does not necessarily imply a disparate impact on another. Most importantly, we show that while optimizing hyperparameters directly on differentially private models does not mitigate the disparate impact of DPSGD reliably, it can still lead to improved utility-fairness trade-offs compared to re-using hyperparameters from non-private models. We stress, however, that any form of hyperparameter tuning entails additional privacy leakage, calling for careful considerations of how to balance privacy, utility and fairness. Finally, we extend our analyses to DPSGD-Global-Adapt, a variant of DPSGD designed to mitigate the disparate impact on accuracy, and conclude that this alternative may not be a robust solution with respect to hyperparameter choice.

CLJul 22, 2025
Exploring Gender Bias in Large Language Models: An In-depth Dive into the German Language

Kristin Gnadt, David Thulke, Simone Kopeinik et al.

In recent years, various methods have been proposed to evaluate gender bias in large language models (LLMs). A key challenge lies in the transferability of bias measurement methods initially developed for the English language when applied to other languages. This work aims to contribute to this research strand by presenting five German datasets for gender bias evaluation in LLMs. The datasets are grounded in well-established concepts of gender bias and are accessible through multiple methodologies. Our findings, reported for eight multilingual LLM models, reveal unique challenges associated with gender bias in German, including the ambiguous interpretation of male occupational terms and the influence of seemingly neutral nouns on gender perception. This work contributes to the understanding of gender bias in LLMs across languages and underscores the necessity for tailored evaluation frameworks.

SEJun 20, 2024
Reproducibility in Machine Learning-based Research: Overview, Barriers and Drivers

Harald Semmelrock, Tony Ross-Hellauer, Simone Kopeinik et al.

Many research fields are currently reckoning with issues of poor levels of reproducibility. Some label it a "crisis", and research employing or building Machine Learning (ML) models is no exception. Issues including lack of transparency, data or code, poor adherence to standards, and the sensitivity of ML training conditions mean that many papers are not even reproducible in principle. Where they are, though, reproducibility experiments have found worryingly low degrees of similarity with original results. Despite previous appeals from ML researchers on this topic and various initiatives from conference reproducibility tracks to the ACM's new Emerging Interest Group on Reproducibility and Replicability, we contend that the general community continues to take this issue too lightly. Poor reproducibility threatens trust in and integrity of research results. Therefore, in this article, we lay out a new perspective on the key barriers and drivers (both procedural and technical) to increased reproducibility at various levels (methods, code, data, and experiments). We then map the drivers to the barriers to give concrete advice for strategies for researchers to mitigate reproducibility issues in their own work, to lay out key areas where further research is needed in specific areas, and to further ignite discussion on the threat presented by these urgent issues.

IRApr 28, 2021
Societal Biases in Retrieved Contents: Measurement Framework and Adversarial Mitigation for BERT Rankers

Navid Rekabsaz, Simone Kopeinik, Markus Schedl

Societal biases resonate in the retrieved contents of information retrieval (IR) systems, resulting in reinforcing existing stereotypes. Approaching this issue requires established measures of fairness in respect to the representation of various social groups in retrieval results, as well as methods to mitigate such biases, particularly in the light of the advances in deep ranking models. In this work, we first provide a novel framework to measure the fairness in the retrieved text contents of ranking models. Introducing a ranker-agnostic measurement, the framework also enables the disentanglement of the effect on fairness of collection from that of rankers. To mitigate these biases, we propose AdvBert, a ranking model achieved by adapting adversarial bias mitigation for IR, which jointly learns to predict relevance and remove protected attributes. We conduct experiments on two passage retrieval collections (MSMARCO Passage Re-ranking and TREC Deep Learning 2019 Passage Re-ranking), which we extend by fairness annotations of a selected subset of queries regarding gender attributes. Our results on the MSMARCO benchmark show that, (1) all ranking models are less fair in comparison with ranker-agnostic baselines, and (2) the fairness of Bert rankers significantly improves when using the proposed AdvBert models. Lastly, we investigate the trade-off between fairness and utility, showing that we can maintain the significant improvements in fairness without any significant loss in utility.

IRJan 30, 2015
Attention Please! A Hybrid Resource Recommender Mimicking Attention-Interpretation Dynamics

Paul Seitlinger, Dominik Kowald, Simone Kopeinik et al.

Classic resource recommenders like Collaborative Filtering (CF) treat users as being just another entity, neglecting non-linear user-resource dynamics shaping attention and interpretation. In this paper, we propose a novel hybrid recommendation strategy that refines CF by capturing these dynamics. The evaluation results reveal that our approach substantially improves CF and, depending on the dataset, successfully competes with a computationally much more expensive Matrix Factorization variant.